Recognition: 2 theorem links
· Lean TheoremTinyLlama: An Open-Source Small Language Model
Pith reviewed 2026-05-13 21:05 UTC · model grok-4.3
The pith
A 1.1 billion parameter model pretrained on one trillion tokens outperforms other open-source models of similar size on downstream tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TinyLlama is a 1.1B parameter language model pretrained on approximately 1 trillion tokens using the Llama 2 architecture and tokenizer together with community optimizations including FlashAttention and Lit-GPT. The model achieves better performance across a series of downstream tasks than existing open-source language models of comparable size.
What carries the argument
TinyLlama, the 1.1B parameter model that combines the Llama 2 architecture with pretraining on 1 trillion tokens and open-source efficiency tools.
If this is right
- Model size can be reduced while maintaining competitive task performance when pretraining data reaches trillions of tokens.
- Open-source efficiency libraries enable practical training runs at the 1B scale.
- Public release of checkpoints supports further fine-tuning and analysis by the community.
- Downstream results improve measurably with increased pretraining tokens even when parameter count stays fixed at 1.1 billion.
Where Pith is reading between the lines
- The result implies that data scaling may substitute for parameter scaling in some performance regimes.
- Hardware-constrained settings could benefit from prioritizing data collection over model enlargement.
- Similar recipes may transfer to other modalities where large corpora exist but compute budgets are limited.
Load-bearing premise
The performance edge comes from the scale of pretraining data and chosen optimizations rather than from evaluation setup or unintended overlap between training and test data.
What would settle it
A head-to-head evaluation on the same downstream tasks in which a 1.1B model trained on substantially less data or without the listed optimizations matches or exceeds TinyLlama's scores.
read the original abstract
We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents TinyLlama, a 1.1B-parameter language model pretrained on approximately 1 trillion tokens for roughly 3 epochs. It adopts the Llama 2 architecture and tokenizer, incorporates open-source optimizations such as FlashAttention and Lit-GPT for computational efficiency, and reports that the model significantly outperforms other open-source language models of comparable size on downstream tasks. Model checkpoints and training code are released publicly on GitHub.
Significance. If the reported performance margins are robust, the work is significant as a reproducible, openly available small-scale model that demonstrates the viability of achieving competitive downstream results through large-scale pretraining (1T tokens) combined with community-driven efficiency improvements. The public release of code and weights directly supports reproducibility and further research on efficient LLMs.
major comments (2)
- [Abstract and §4 (Evaluation)] Abstract and evaluation sections: the central claim of significant outperformance over comparable open-source models lacks reported error bars, standard deviations across multiple runs, or statistical significance tests on the benchmark scores. Without these, it is impossible to determine whether the observed margins reflect genuine improvements or evaluation variance.
- [§3 and §4] §3 (Training) and §4: the manuscript does not describe explicit data decontamination or contamination checks between the 1T-token pretraining corpus and the downstream evaluation benchmarks. This is load-bearing for the claim that gains arise from architecture, token count, and optimizations rather than test-set overlap.
minor comments (2)
- [Results tables] Table 1 or equivalent results table: include the exact number of tokens seen per model and the precise evaluation harness (e.g., prompting format, few-shot settings) used for every baseline to enable direct replication.
- [§2] §2 (Related Work): add a brief comparison of training FLOPs or wall-clock time against the strongest baselines to quantify the claimed computational-efficiency gains from FlashAttention and Lit-GPT.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation. We address each major comment below and have revised the manuscript accordingly where feasible.
read point-by-point responses
-
Referee: [Abstract and §4 (Evaluation)] Abstract and evaluation sections: the central claim of significant outperformance over comparable open-source models lacks reported error bars, standard deviations across multiple runs, or statistical significance tests on the benchmark scores. Without these, it is impossible to determine whether the observed margins reflect genuine improvements or evaluation variance.
Authors: We agree that error bars and statistical tests would strengthen the evaluation. However, pretraining a 1.1B model on 1T tokens required substantial compute, and we performed only a single training run. In the revised manuscript we have added an explicit statement in Section 4 noting that all reported results are from this single run and discussing the implications for variance. We also reference prior work on LLM evaluation stability. While we cannot rerun the full pretraining to obtain multiple seeds, the consistent gains across diverse benchmarks support the robustness of the findings. revision: partial
-
Referee: [§3 and §4] §3 (Training) and §4: the manuscript does not describe explicit data decontamination or contamination checks between the 1T-token pretraining corpus and the downstream evaluation benchmarks. This is load-bearing for the claim that gains arise from architecture, token count, and optimizations rather than test-set overlap.
Authors: We acknowledge the importance of this point. The original submission did not detail explicit decontamination checks. The pretraining corpus is SlimPajama, which applies its own deduplication and filtering. In the revised manuscript we have expanded Section 3 with additional description of the data sources and preprocessing pipeline, and added a paragraph in Section 4 explicitly noting the absence of benchmark-specific contamination analysis as a limitation. We believe the scale of training and the nature of the cleaned dataset make substantial leakage unlikely, but we agree this should be stated clearly. revision: partial
Circularity Check
No circularity: empirical training results on external benchmarks
full rationale
The paper reports an empirical outcome: pretraining a 1.1B model on ~1T tokens using the Llama 2 architecture plus open-source optimizations (FlashAttention, Lit-GPT), then measuring downstream task performance against existing open-source baselines. No equations, derivations, fitted parameters, or predictions appear; the central claim is a direct training-and-evaluate result. Self-citations are limited to the external Llama 2 paper and standard tools; none function as load-bearing uniqueness theorems or ansatzes that reduce the result to the authors' prior inputs. The work is therefore self-contained against external benchmarks with no reduction by construction.
Axiom & Free-Parameter Ledger
free parameters (3)
- model parameter count
- pretraining token count
- training epochs
axioms (2)
- domain assumption Llama 2 architecture and tokenizer are suitable base for small-scale pretraining
- domain assumption Open-source efficiency tools (FlashAttention, Lit-GPT) preserve model quality while improving speed
Forward citations
Cited by 29 Pith papers
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
-
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
-
When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation
Strict regex parsing of LLM security log outputs introduces systematic errors that can make functional models appear non-functional, with a 76-point accuracy gap recovered by fuzzy parsing.
-
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
-
BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration
BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods...
-
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
-
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.
-
Common-agency Games for Multi-Objective Test-Time Alignment
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
-
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...
-
Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation
LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.
-
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
-
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...
-
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user ...
-
RAGShield: Detecting Numerical Claim Manipulation in Government RAG Systems
RAGShield detects all numerical manipulations in government RAG systems via pattern-based value extraction and cross-source verification, achieving 0% attack success rate on 430 real IRS-derived attacks where embeddin...
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
-
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
-
DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models
DP-LAC provides a new adaptive clipping technique for DP-SGD in federated LLM fine-tuning that improves accuracy by 6.6% on average without consuming additional privacy budget or requiring new hyperparameters.
-
TabEmb: Joint Semantic-Structure Embedding for Table Annotation
TabEmb decouples LLM-based semantic column embeddings from graph-based structural modeling to produce joint representations that improve table annotation tasks.
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
-
Acceptance Dynamics Across Cognitive Domains in Speculative Decoding
Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.
-
VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
VectraYX-Nano is a 42M-parameter Spanish cybersecurity LLM trained with curriculum learning and native MCP tool use, achieving 0.78 conversational gate and improved tool selection with denser data.
-
Agentic Performance at the Edge: Insights from Benchmarking
Edge agentic AI quality is not a simple function of model size; robust results require joint design of model selection and tool integration, as revealed by domain-conditioned benchmarks showing accuracy-latency Pareto fronts.
-
OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis
LoRA fine-tuning of TinyLlama-1.1B on 450 SOC examples produces 68% threat classification accuracy and 58% severity accuracy on 50 held-out logs, with full code, weights, and data released.
-
DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs
DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.
-
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
-
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
-
ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook
ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.
Reference graph
Works this paper leans on
-
[1]
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebron, F., and Sanghai, S. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP
work page 2023
-
[2]
Yu, J., Zhang, Q., Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y . (2023). Palm 2 technical report
work page 2023
- [3]
-
[4]
Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y ., Yu, B., Yuan, H., Y...
work page 2023
-
[5]
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al. (2024). Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of ICML
work page 2023
-
[7]
Bisk, Y ., Zellers, R., Gao, J., Choi, Y ., et al. (2020). Piqa: Reasoning about physical commonsense in natural language. In Proceedings of AAAI
work page 2020
-
[8]
Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Proceedings of NeurIPS
work page 2020
-
[9]
K., Hong, P., Bing, L., and Poria, S
Chia, Y . K., Hong, P., Bing, L., and Poria, S. (2023). INSTRUCTEV AL: towards holistic evaluation of instruction-tuned large language models. CoRR, abs/2306.04757
-
[10]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. (2019). BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL
work page 2019
-
[12]
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., and Stoyanov, V . (2018). XNLI: Evaluating cross-lingual sentence representations. In Riloff, E., Chiang, D.,
work page 2018
-
[14]
Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
N., Fan, A., Auli, M., and Grangier, D
Dauphin, Y . N., Fan, A., Auli, M., and Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of ICML
work page 2017
-
[16]
Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of NAACL
work page 2019
-
[17]
Emelin, D. and Sennrich, R. (2021). Wino-X: Multilingual Winograd schemas for commonsense reasoning and coreference resolution. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of EMNLP, pages 8517–8532, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics
work page 2021
-
[18]
Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2023). A framework for few-shot language model evaluation
work page 2023
-
[19]
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). Measuring massive multitask language understanding. In Proceedings of ICLR
work page 2021
-
[20]
Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. (2022). Training compute-optimal large language models. In Proceedings of NeurIPS. 8
work page 2022
-
[21]
Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ., Huang, Y ., Zhao, W., et al. (2024). Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[23]
Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W., Caggiano, V ., Naren, S., Xu, M., Hu, J., Tintore, M., Zhang, S., Labatut, P., and Haziza, D. (2022). xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers
work page 2022
-
[24]
Hughes, S., Wolf, T., Guha, A., Werra, L. V ., and de Vries, H. (2023). Starcoder: may the source be with you! Transactions on Machine Learning Research. Lightning-AI (2023). Lit-gpt
work page 2023
-
[25]
Du, J., Pasunuru, R., Shleifer, S., Koura, P. S., Chaudhary, V ., O’Horo, B., Wang, J., Zettlemoyer, L., Kozareva, Z., Diab, M., Stoyanov, V ., and Li, X. (2022). Few-shot learning with multilingual generative language models. In Goldberg, Y ., Kozareva, Z., and Zhang, Y ., editors,Proceedings of EMNLP, pages 9019–9052, Abu Dhabi, United Arab Emirates. As...
work page 2022
-
[26]
Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. In Proceedings of ICLR
work page 2019
-
[27]
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of EMNLP. OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
M., Glavaš, G., Majewska, O., Liu, Q., Vuli´c, I., and Korhonen, A
Ponti, E. M., Glavaš, G., Majewska, O., Liu, Q., Vuli´c, I., and Korhonen, A. (2020). XCOPA: A multilingual dataset for causal commonsense reasoning. In Webber, B., Cohn, T., He, Y ., and Liu, Y ., editors,Proceedings of EMNLP , pages 2362–2376, Online. Association for Computational Linguistics
work page 2020
-
[29]
L., Bhagavatula, C., and Choi, Y
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . (2021). Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106
work page 2021
-
[30]
Shazeer, N. (2020). GLU variants improve transformer. CoRR, abs/2002.05202
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[31]
Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., and Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama
work page 2023
-
[32]
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Su, J., Lu, Y ., Pan, S., Murtadha, A., Wen, B., and Liu, Y . (2021). Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Chi, E., Zhou, D., and Wei, J. (2023). Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of ACL. Thaddée, Y . T. (2023). Chinchilla’s death. https://espadrine.github.io/blog/posts/chinchilla-s- death.html. 9 Together Computer (2023). Redpajama: an open dataset for training large language models
work page 2023
-
[35]
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Bhargava, P., Bhosale, S., et al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Polosukhin, I. (2017). Attention is all you need. In Proceedings of NeurIPS
work page 2017
-
[38]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. In Proceedings of NeurIPS
work page 2022
- [39]
-
[40]
Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . (2019). HellaSwag: Can a machine really finish your sentence? In Proceedings of the ACL
work page 2019
-
[41]
Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. In Proceedings of NeurIPS
work page 2019
-
[42]
OPT: Open Pre-trained Transformer Language Models
Lin, X. V ., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Su, T., Yang, Z., and Tang, J. (2023). Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023 , pages 5673–5684. ACM. Based on this observation, we reduced the number of tokens ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.