Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

Akanksha Gupta; Bijo Thomas; Harshita Asnani; Mecit Gungor; Phanindra Reddy Madduru; Samia Feroze; Shreyas Subramanian; Vikram Elango

arxiv: 2501.05465 · v2 · pith:2A3QF3NHnew · submitted 2025-01-03 · 💻 cs.CL

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

Akanksha Gupta , Bijo Thomas , Harshita Asnani , Phanindra Reddy Madduru , Samia Feroze , Shreyas Subramanian , Vikram Elango , Mecit Gungor This is my paper

Pith reviewed 2026-05-23 05:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords small language modelsmodel efficiencyperformance comparisonlanguage model surveyparameter scalingtask-specific models

0 comments

The pith

Small language models with 1 to 8 billion parameters can match or outperform much larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews roughly 160 papers to argue that Small Language Models in the 1-8 billion parameter range achieve performance equal to or better than larger models on many tasks. It examines both general-purpose and task-specific SLMs along with training techniques that trade off performance against efficiency, scalability, and cost. The work also introduces the concept of effective sizes to measure real capability gains relative to bigger models. A sympathetic reader would care because this challenges the assumption that bigger is always better and points toward more practical paths for building capable systems.

Core claim

The paper establishes through its survey that a family of SLMs sized 1 to 8 billion parameters can perform as well as or even outperform large models, while defining and characterizing their effective sizes to represent increased capability with respect to LLMs.

What carries the argument

The survey of approximately 160 works on SLMs together with the definition of effective sizes that quantify capability beyond raw parameter count.

If this is right

Task-specific SLMs can deliver targeted performance without the overhead of full-scale models.
Techniques for creating SLMs allow balancing of accuracy, speed, and resource use.
Effective size metrics provide a way to compare models that accounts for real capability rather than parameter count alone.
General-purpose SLMs in this range become viable alternatives for many applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Development effort may shift toward architecture and data choices that maximize performance per parameter rather than raw scale.
Smaller models could enable wider deployment on consumer hardware or in low-resource settings.
Hybrid systems that combine multiple SLMs might achieve results previously thought to require single large models.

Load-bearing premise

The selected papers are representative of the field and their reported performance numbers are directly comparable across different model scales and evaluation setups.

What would settle it

A controlled head-to-head benchmark that trains and evaluates multiple SLMs and LLMs on the exact same tasks with identical data, metrics, and hardware would show larger models consistently superior.

Figures

Figures reproduced from arXiv: 2501.05465 by Akanksha Gupta, Bijo Thomas, Harshita Asnani, Mecit Gungor, Phanindra Reddy Madduru, Samia Feroze, Shreyas Subramanian, Vikram Elango.

**Figure 1.** Figure 1: Mind map of topics covered in the paper skills, including some reasoning and understanding. The extent to which small models can achieve this remains uncertain. For instance, the TinyStories model (10M parameters) successfully generated coherent English stories using a synthetic dataset created by larger models (GPT 3.5 and GPT 4). However, as indicated in the preliminary [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 2.** Figure 2: Equivalent sizes of SLMs based on performance benchmarks; more [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

read the original abstract

As foundation AI models continue to increase in size, an important question arises - is massive scale the only path forward? This survey of about 160 papers presents a family of Small Language Models (SLMs) in the 1 to 8 billion parameter range that demonstrate smaller models can perform as well, or even outperform large models. We explore task agnostic, general purpose SLMs, task-specific SLMs and techniques to create SLMs that can guide the community to build models while balancing performance, efficiency, scalability and cost. Furthermore we define and characterize SLMs' effective sizes, representing increased capability with respect to LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Survey compiles ~160 SLM papers and defines effective sizes but does not establish selection rules or normalize cross-paper results.

read the letter

This is a survey that gathers existing work on 1-8B parameter models and claims they can match or beat larger ones. The main point for you is that it functions as an organized reference rather than adding new data or derivations. It breaks the literature into task-agnostic general models, task-specific ones, and the techniques used to build them, then introduces the notion of effective sizes to describe capability relative to bigger models. That framing and the coverage of efficiency-cost tradeoffs could give a reader a quicker map of the area than starting from scratch on arXiv searches. The paper does a reasonable job of pointing to methods that balance performance with resource use. The soft spot is the central claim. The argument that smaller models can outperform larger ones rests on the ~160 cited papers supplying directly comparable numbers. Without explicit inclusion criteria, discussion of publication bias, or adjustments for differences in training data scale, compute, or benchmarks, it is hard to know whether the highlighted successes reflect real patterns or just non-comparable setups. The abstract gives no sign that these issues were tackled systematically, so that part of the paper stays thin. This work is for engineers or researchers who want an entry-level overview of small-model techniques before digging into originals. It is not strong enough on its own to settle deployment questions. I would bring it to a reading group only if the group is already focused on efficient models and wants a list of references to check. I would not cite it as evidence for the performance claim. It deserves peer review in a survey-friendly venue, but referees should require a clearer methods section on how papers were chosen and compared.

Referee Report

3 major / 2 minor

Summary. This survey reviews ~160 papers on Small Language Models (SLMs) in the 1-8B parameter range. It claims that such SLMs can match or outperform larger models on tasks, examines task-agnostic and task-specific SLMs plus creation techniques, and introduces the notion of 'effective sizes' that characterize SLM capability relative to LLMs.

Significance. If the surveyed results prove representative and comparable, the work would supply concrete counter-evidence to strict scaling hypotheses, supporting research into efficient, lower-cost models and broadening access to capable language technology.

major comments (3)

[Introduction and Survey Scope] The central claim that SLMs can match or exceed LLMs rests on the representativeness of the ~160 selected papers, yet the manuscript provides no explicit inclusion/exclusion criteria, search protocol, or discussion of publication bias (Introduction and Survey Scope sections).
[Performance Comparison and Results] Performance numbers drawn from the cited works are treated as directly comparable, but the text contains no normalization for differences in training data volume, compute budget, benchmark versions, or evaluation protocols, which directly affects the validity of cross-scale claims (Performance Comparison and Results sections).
[Discussion and Limitations] No systematic treatment of counterexamples or negative results is presented; without this, selective highlighting of positive SLM outcomes cannot be ruled out as the driver of the headline conclusion (Discussion and Limitations sections).

minor comments (2)

[Title] The parenthetical '(updated 2026)' in the title is unclear and should be explained or corrected.
[Figures and Tables] Figure captions and table headers would benefit from explicit statements of the exact metrics and model scales being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our survey. We address each major comment below and commit to revisions that strengthen the manuscript's transparency and balance without altering its core contributions.

read point-by-point responses

Referee: The central claim that SLMs can match or exceed LLMs rests on the representativeness of the ~160 selected papers, yet the manuscript provides no explicit inclusion/exclusion criteria, search protocol, or discussion of publication bias (Introduction and Survey Scope sections).

Authors: We agree that these methodological details were insufficiently explicit. In the revised manuscript, we will add a dedicated subsection under Survey Scope that specifies the search protocol (keywords, databases, date range 2020-2025), inclusion criteria (models strictly 1-8B parameters with reported LLM comparisons on standard benchmarks), exclusion criteria (non-comparative studies, non-English papers, duplicate reports), and a short discussion of publication bias acknowledging that positive results may be over-represented in the literature. revision: yes
Referee: Performance numbers drawn from the cited works are treated as directly comparable, but the text contains no normalization for differences in training data volume, compute budget, benchmark versions, or evaluation protocols, which directly affects the validity of cross-scale claims (Performance Comparison and Results sections).

Authors: The referee is correct that unnormalized comparisons introduce uncertainty. We will revise the Performance Comparison section to add an explicit limitations paragraph that (a) enumerates the sources of incomparability, (b) reports available training details (data volume, compute) for the most-cited examples, and (c) cautions readers that headline claims are indicative rather than definitive. Full statistical normalization is not feasible within a survey format, so this will be framed as a methodological limitation. revision: partial
Referee: No systematic treatment of counterexamples or negative results is presented; without this, selective highlighting of positive SLM outcomes cannot be ruled out as the driver of the headline conclusion (Discussion and Limitations sections).

Authors: We accept this criticism. The revised Discussion and Limitations section will contain a new subsection titled 'Counterexamples and Negative Results' that reviews documented cases where SLMs underperform (e.g., long-context reasoning, certain multilingual tasks) and cites papers showing continued benefits from scale. This addition will make the survey more balanced and reduce the risk of perceived selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey with no derivations or fitted predictions

full rationale

This is a survey paper summarizing ~160 existing works on small language models (1-8B parameters). It contains no original mathematical derivations, equations, parameter fittings, uniqueness theorems, or ansatzes that could reduce to self-referential inputs. The central claim is an aggregation of reported results from the surveyed literature rather than a constructed prediction or self-defined quantity. No load-bearing self-citations or renamings of known results are present in a way that matches the enumerated circularity patterns. The paper is self-contained as a review and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no new mathematical derivations, models, or empirical claims introduced by the authors.

pith-pipeline@v0.9.0 · 5666 in / 910 out tokens · 52483 ms · 2026-05-23T05:45:53.650870+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
cs.PF 2025-08 unverdicted novelty 5.0

ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference wh...
Small Language Models are the Future of Agentic AI
cs.AI 2025-06 unverdicted novelty 5.0

Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
cs.LG 2026-04 unverdicted novelty 3.0

Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

Reference graph

Works this paper leans on

161 extracted references · 161 canonical work pages · cited by 3 Pith papers · 33 internal anchors

[1]

Mixtral of Experts

Mixtral of experts, 2023, Jiang, Albert Q., Alexandre Sablayrolles, An- toine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot et al. arXiv preprint arXiv:2401.04088 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Openllm leaderboard, huggingface, 2024

work page 2024
[3]

Phi-2: The surprising power of small language models, 2024

work page 2024
[4]

Smollm - blazingly fast and remarkably powerful, 2024

work page 2024
[5]

J., Javaheripi, M., Kauffmann, P., Lee, J

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gu- nasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Wang, X., Ward, R., Wu, Y., Yu, D., Zhang, C., and Zhang, Y. Phi-4 technical report, 2024

work page 2024
[6]

S., Kalai, T., Wanf, X., Ward, R., Witte, P., Zhang, C., and Zhang, Y

Abdin, M., Aneja, J., Bubeck, S., C ´esar, C., Mendes, T., Chen, W., Del Giorno, Allie abd Eldan, R., Gopi, S., Gunasekar, S., Javaheripi, M., Kauffmann, Piero abd Tat Lee, Y., Li, Yuanzhi ans Nguyen, A., de Rosa, G., Saarikivi, O., Salim, Adil a Shi- tal Shah, S., Santacroce, M., Behl, H. S., Kalai, T., Wanf, X., Ward, R., Witte, P., Zhang, C., and Zhang...

work page
[7]

A., Awan, A

Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J....

work page 2024
[8]

Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning

Aghajanyan, A., Zettlemoyer, L., and Gupta, S. Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning. ArXiv abs/2012.13255 (2020). 28

work page arXiv 2012
[9]

Finbert: Financial sentiment analysis with pre-trained lan- guage models, 2019

Araci, D. Finbert: Financial sentiment analysis with pre-trained lan- guage models, 2019

work page 2019
[10]

Armengol-Estap´e, J., Woodruff, J., Cummins, C., and O’Boyle, M. F. P. Slade: A portable small language model decompiler for opti- mized assembly, 2024

work page 2024
[11]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Speculative streaming: Fast llm inference without auxiliary models

Bhendawade, N., Belousova, I., Fu, Q., Mason, H., Rastegari, M., and Najibi, M. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131 (2024)

work page arXiv 2024
[13]

PIQA: Reasoning about Physical Commonsense in Natural Language

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language. ArXiv abs/1911.11641 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1911
[14]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu

Cao, J., Prakash, C., and Hamza, W. Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu. InNAACL-HLT (2022)

work page 2022
[16]

Speedupnet: A plug-and-play hyper-network for accelerating text- to-image diffusion models

Chai, W., Zheng, D., Cao, J., Chen, Z., Wang, C., and Ma, C. Speedupnet: A plug-and-play hyper-network for accelerating text- to-image diffusion models. arXiv preprint arXiv:2312.08887 (2023)

work page arXiv 2023
[17]

Parameter-efficient fine-tuning design spaces

Chen, J., Zhang, A., Shi, X., Li, M., Smola, A., and Yang, D. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821 (2023)

work page arXiv 2023
[18]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D. W., Plappert, M., Chantzis, F., ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

W., Sutton, C., Gehrmann, S., et al

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113

work page 2023
[20]

W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B. C., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., ...

work page 2022
[21]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv abs/1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. ArXiv abs/2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Fleurs: Few-shot learning evaluation of universal representations of speech

Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., and Bapna, A. Fleurs: Few-shot learning evaluation of universal representations of speech. 2022 IEEE Spoken Language Technology Workshop (SLT) (2022), 798–805

work page 2022
[24]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., and Sun, M. Ultrafeedback: Boosting language models with high-quality feedback. ArXiv abs/2310.01377 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Chatlaw: Open- source legal large language model with integrated external knowledge bases, 2023

Cui, J., Li, Z., Yan, Y., Chen, B., and Yuan, L. Chatlaw: Open- source legal large language model with integrated external knowledge bases, 2023

work page 2023
[26]

Compacter: Efficient low-rank hypercomplex adapter layers

Davison, J. Compacter: Efficient low-rank hypercomplex adapter layers. In Neural Information Processing Systems (2021)

work page 2021
[27]

S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

Dey, N., Gosal, G. S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J. Cerebras-gpt: Open 30 compute-optimal language models trained on the cerebras wafer-scale clus- ter. ArXiv abs/2304.03208 (2023)

work page arXiv 2023
[28]

Enhancing chat language models by scaling high- quality instructional conversations

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high- quality instructional conversations. In Conference on Empirical Methods in Natural Language Processing (2023)

work page 2023
[29]

S., Liu, S.-Y., Keirsbilck, M

Dong, X., Fu, Y., Diao, S., Byeon, W., Chen, Z., Mahabalesh- warkar, A. S., Liu, S.-Y., Keirsbilck, M. V., Chen, M.-H., Suhara, Y., Lin, Y., Kautz, J., and Molchanov, P. Hymba: A hybrid-head architecture for small language models, 2024

work page 2024
[30]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

P., Clark, J

Edalati, A., Tahaei, M., Kobyzev, I., Nia, V. P., Clark, J. J., and Rezagholizadeh, M. Krona: Parameter efficient tuning with kro- necker adapter. arXiv preprint arXiv:2212.10650 (2022)

work page arXiv 2022
[32]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Eldan, R., and Li, Y.-F. Tinystories: How small can language models be and still speak coherent english? ArXiv abs/2305.07759 (2023)

work page internal anchor Pith review arXiv 2023
[33]

Gptq: Accurate post-training quantization for generative pre-trained transform- ers, 2023

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transform- ers, 2023

work page 2023
[34]

In International Conference on Machine Learning (2021)

Fu, C., Huang, H., Chen, X., Tian, Y., and Zhao, J.Learn-to-share: A hardware-friendly transfer learning framework exploiting computation and parameter sharing. In International Conference on Machine Learning (2021)

work page 2021
[35]

Break the sequential dependency of llm inference using lookahead decoding

Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057 (2024)

work page arXiv 2024
[36]

Fu, Y., Peng, H., Ou, L., Sabharwal, A., and Khot, T.Specializing smaller language models towards multi-step reasoning, 2023

work page 2023
[37]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Fos- ter, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[38]

Zamba: A compact 7b ssm hybrid model, 2024

Glorioso, P., Anthony, Q., Tokpanov, Y., Whittington, J., Pi- lault, J., Ibrahim, A., and Millidge, B. Zamba: A compact 7b ssm hybrid model, 2024. 31

work page 2024
[39]

J., and Tao, D.Knowledge distillation: A survey

Gou, J., Yu, B., Maybank, S. J., and Tao, D.Knowledge distillation: A survey. International Journal of Computer Vision 129 , 6 (Mar. 2021), 1789–1819

work page 2021
[40]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A., and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

D., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Singh Behl, H., Wang, X., Bubeck, S., Eldan, R., Kalai, A

Gunasekar, S., Zhang, Y., Aneja, J., Cesar, C., Mendes, T., Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Singh Behl, H., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. Textbooks are all you need, June 2023

work page 2023
[42]

Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Guo, Z., Wang, P., Wang, Y., and Yu, S. Dr. llama: Improving small language models on pubmedqa via generative data augmentation. ArXiv abs/2305.07804 (2023)

work page arXiv 2023
[44]

Improving small language models on pubmedqa via generative data augmentation

Guo, Z., Wang, P., Wang, Y., and Yu, S. Improving small language models on pubmedqa via generative data augmentation. arXiv, Jul 12 (2023)

work page 2023
[45]

A., Mishra, S., Nakamura, M., Mitra, A., Mashetty, S., and Baral, C

Gupta, H., Sawant, S. A., Mishra, S., Nakamura, M., Mitra, A., Mashetty, S., and Baral, C. Instruction tuned models are quick learners, 2023

work page 2023
[46]

V., Prabhala, H., Paul, S., and Platen, P

Gupta, Y., Jaddipal, V. V., Prabhala, H., Paul, S., and Platen, P. V. Progressive knowledge distillation of stable diffusion xl using layer level loss, 2024

work page 2024
[47]

U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M

Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., et al. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints (2023)

work page 2023
[48]

Neural machine translation of clinical text: An em- pirical investigation into multilingual pre-trained language models and transfer-learning, 2023

Han, L., Gladkoff, S., Erofeev, G., Sorokina, I., Galiano, B., and Nenadic, G. Neural machine translation of clinical text: An em- pirical investigation into multilingual pre-trained language models and transfer-learning, 2023

work page 2023
[49]

Towards a unified view of parameter-efficient transfer learning

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 (2021). 32

work page arXiv 2021
[50]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. X., and Steinhardt, J. Measuring massive multitask lan- guage understanding. ArXiv abs/2009.03300 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2009
[51]

Distilling the knowledge in a neural network, 2015

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network, 2015

work page 2015
[52]

Parameter-Efficient Transfer Learning for NLP

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for nlp. ArXiv abs/1902.00751 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1902
[53]

Distilling step- by-step! outperforming larger language models with less training data and smaller model sizes, 2023

Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Rat- ner, A., Krishna, R., Lee, C.-Y., and Pfister, T. Distilling step- by-step! outperforming larger language models with less training data and smaller model sizes, 2023

work page 2023
[54]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, J. E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models. ArXiv abs/2106.09685 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Lawyer llama technical report, 2023

Huang, Q., Tao, M., Zhang, C., An, Z., Jiang, C., Chen, Z., Wu, Z., and Feng, Y. Lawyer llama technical report, 2023

work page 2023
[56]

How good are low-bit quantized llama3 models? an empirical study, 2024

Huang, W., Ma, X., Qin, H., Zheng, X., Lv, C., Chen, H., Luo, J., Qi, X., Liu, X., and Magno, M. How good are low-bit quantized llama3 models? an empirical study, 2024

work page 2024
[57]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chap- lot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Model pruning and deploy- ment optimization for ship detection

Jiang, Z., Chen, X., Gu, Y., and An, K. Model pruning and deploy- ment optimization for ship detection. In2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP) (Los Alamitos, CA, USA, apr 2023), IEEE Computer Society, pp. 1961–1968

work page 2023
[59]

PubMedQA: A Dataset for Biomedical Research Question Answering

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. Pub- medqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019)

work page internal anchor Pith review arXiv 1909
[60]

Flame: A small lan- guage model for spreadsheet formulas, 2023

Joshi, H., Ebenezer, A., Cambronero, J., Gulwani, S., Kanade, A., Le, V., Radi ˇcek, I., and Verbruggen, G. Flame: A small lan- guage model for spreadsheet formulas, 2023

work page 2023
[61]

Prometheus: Inducing fine-grained evaluation capability in language models

Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., et al. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491 (2023). 33

work page arXiv 2023
[62]

Inducing and exploiting activation sparsity for fast inference on deep neural networks

Kurtz, M., Kopinsky, J., Gelashvili, R., Matveev, A., Carr, J., Goin, M., Leiserson, W., Moore, S., Nell, B., Shavit, N., and Alistarh, D. Inducing and exploiting activation sparsity for fast inference on deep neural networks. InProceedings of the 37th International Conference on Machine Learning (Virtual, 13–18 Jul 2020), H. D. III and A. Singh, Eds., vo...

work page 2020
[63]

Can small language models help large language models reason better?: Lm-guided chain-of-thought, 2024

Lee, J., Yang, F., Tran, T., Hu, Q., Barut, E., Chang, K.-W., and Su, C. Can small language models help large language models reason better?: Lm-guided chain-of-thought, 2024

work page 2024
[64]

H., Kim, G., and Seo, M

Lee, S., Kim, S., Park, S. H., Kim, G., and Seo, M. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation.arXiv preprint arXiv:2401.06591 (2024)

work page arXiv 2024
[65]

The power of scale for parameter-efficient prompt tuning

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing (2021)

work page 2021
[66]

Fast inference from transformers via speculative decoding

Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning (2023), PMLR, pp. 19274–19286

work page 2023
[67]

H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y

Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y. Symbolic chain-of-thought distillation: Small models can also ”think” step-by-step, 2023

work page 2023
[68]

StarCoder: may the source be with you!

Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q., Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O., Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko, O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Umapathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang, Z., Murthy, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Li, X. L., and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) abs/2101.00190 (2021). 34

work page internal anchor Pith review Pith/arXiv arXiv 2021
[70]

D., Gunasekar, S., and Lee, Y

Li, Y., Bubeck, S., Eldan, R., Giorno, A. D., Gunasekar, S., and Lee, Y. T. Textbooks are all you need ii: phi-1.5 technical report. September 2023

work page 2023
[71]

Large language models in finance: A survey

Li, Y., Wang, S., Ding, H., and Chen, H. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance (2023), pp. 374–382

work page 2023
[72]

Jamba: A hybrid transformer- mamba language model, 2024

Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedi- gos, I., Safahi, E., Meirom, S., Belinkov, Y., Shalev-Shwartz, S., Abend, O., Alon, R., Asida, T., Bergman, A., Glozman, R., Gokhman, M., Manevich, A., Ratner, N., Rozen, N., Shwartz, E., Zusman, M., and Shoham, Y. Jamba: A hybrid transformer- mamba language model, 2024

work page 2024
[73]

Awq: Activation-aware weight quantization for llm compression and acceleration, 2023

Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023

work page 2023
[74]

Tinygsm: achieving ¿80% on gsm8k with small language models

Liu, B., Bubeck, S., Eldan, R., Kulkarni, J., Li, Y., Nguyen, A., Ward, R., and Zhang, Y. Tinygsm: achieving ¿80% on gsm8k with small language models. ArXiv abs/2312.09241 (2023)

work page arXiv 2023
[75]

A.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. A.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35 (2022), 1950–1965

work page 2022
[76]

F., Cheng, K.-T., and Chen, M.-H

Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., and Chen, M.-H. Dora: Weight-decomposed low-rank adaptation, 2024

work page 2024
[77]

Finbert: A pre-trained financial language representation model for financial text mining

Liu, Z., Huang, D., Huang, K., Li, Z., and Zhao, J. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence (2021), pp. 4513– 4519

work page 2021
[78]

W., Tay, Y., Zhou, D., Le, Q

Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., and Roberts, A. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning (2023)

work page 2023
[79]

Blending is all you need: Cheaper, better alternative to trillion-parameters llm

Lu, X., Liusie, A., Raina, V., Zhang, Y., and Beauchamp, W. Blending is all you need: Cheaper, better alternative to trillion-parameters llm

work page
[80]

Exploring small language models with prompt-learning paradigm for efficient domain-specific text classification, 2023

Luo, H., Liu, P., and Esping, S. Exploring small language models with prompt-learning paradigm for efficient domain-specific text classification, 2023. 35

work page 2023

Showing first 80 references.

[1] [1]

Mixtral of Experts

Mixtral of experts, 2023, Jiang, Albert Q., Alexandre Sablayrolles, An- toine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot et al. arXiv preprint arXiv:2401.04088 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Openllm leaderboard, huggingface, 2024

work page 2024

[3] [3]

Phi-2: The surprising power of small language models, 2024

work page 2024

[4] [4]

Smollm - blazingly fast and remarkably powerful, 2024

work page 2024

[5] [5]

J., Javaheripi, M., Kauffmann, P., Lee, J

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gu- nasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Wang, X., Ward, R., Wu, Y., Yu, D., Zhang, C., and Zhang, Y. Phi-4 technical report, 2024

work page 2024

[6] [6]

S., Kalai, T., Wanf, X., Ward, R., Witte, P., Zhang, C., and Zhang, Y

Abdin, M., Aneja, J., Bubeck, S., C ´esar, C., Mendes, T., Chen, W., Del Giorno, Allie abd Eldan, R., Gopi, S., Gunasekar, S., Javaheripi, M., Kauffmann, Piero abd Tat Lee, Y., Li, Yuanzhi ans Nguyen, A., de Rosa, G., Saarikivi, O., Salim, Adil a Shi- tal Shah, S., Santacroce, M., Behl, H. S., Kalai, T., Wanf, X., Ward, R., Witte, P., Zhang, C., and Zhang...

work page

[7] [7]

A., Awan, A

Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J....

work page 2024

[8] [8]

Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning

Aghajanyan, A., Zettlemoyer, L., and Gupta, S. Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning. ArXiv abs/2012.13255 (2020). 28

work page arXiv 2012

[9] [9]

Finbert: Financial sentiment analysis with pre-trained lan- guage models, 2019

Araci, D. Finbert: Financial sentiment analysis with pre-trained lan- guage models, 2019

work page 2019

[10] [10]

Armengol-Estap´e, J., Woodruff, J., Cummins, C., and O’Boyle, M. F. P. Slade: A portable small language model decompiler for opti- mized assembly, 2024

work page 2024

[11] [11]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Speculative streaming: Fast llm inference without auxiliary models

Bhendawade, N., Belousova, I., Fu, Q., Mason, H., Rastegari, M., and Najibi, M. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131 (2024)

work page arXiv 2024

[13] [13]

PIQA: Reasoning about Physical Commonsense in Natural Language

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language. ArXiv abs/1911.11641 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1911

[14] [14]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu

Cao, J., Prakash, C., and Hamza, W. Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu. InNAACL-HLT (2022)

work page 2022

[16] [16]

Speedupnet: A plug-and-play hyper-network for accelerating text- to-image diffusion models

Chai, W., Zheng, D., Cao, J., Chen, Z., Wang, C., and Ma, C. Speedupnet: A plug-and-play hyper-network for accelerating text- to-image diffusion models. arXiv preprint arXiv:2312.08887 (2023)

work page arXiv 2023

[17] [17]

Parameter-efficient fine-tuning design spaces

Chen, J., Zhang, A., Shi, X., Li, M., Smola, A., and Yang, D. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821 (2023)

work page arXiv 2023

[18] [18]

Evaluating Large Language Models Trained on Code

Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D. W., Plappert, M., Chantzis, F., ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [19]

W., Sutton, C., Gehrmann, S., et al

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113

work page 2023

[20] [20]

W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B. C., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., ...

work page 2022

[21] [21]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv abs/1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. ArXiv abs/2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Fleurs: Few-shot learning evaluation of universal representations of speech

Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., and Bapna, A. Fleurs: Few-shot learning evaluation of universal representations of speech. 2022 IEEE Spoken Language Technology Workshop (SLT) (2022), 798–805

work page 2022

[24] [24]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., and Sun, M. Ultrafeedback: Boosting language models with high-quality feedback. ArXiv abs/2310.01377 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Chatlaw: Open- source legal large language model with integrated external knowledge bases, 2023

Cui, J., Li, Z., Yan, Y., Chen, B., and Yuan, L. Chatlaw: Open- source legal large language model with integrated external knowledge bases, 2023

work page 2023

[26] [26]

Compacter: Efficient low-rank hypercomplex adapter layers

Davison, J. Compacter: Efficient low-rank hypercomplex adapter layers. In Neural Information Processing Systems (2021)

work page 2021

[27] [27]

S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

Dey, N., Gosal, G. S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J. Cerebras-gpt: Open 30 compute-optimal language models trained on the cerebras wafer-scale clus- ter. ArXiv abs/2304.03208 (2023)

work page arXiv 2023

[28] [28]

Enhancing chat language models by scaling high- quality instructional conversations

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high- quality instructional conversations. In Conference on Empirical Methods in Natural Language Processing (2023)

work page 2023

[29] [29]

S., Liu, S.-Y., Keirsbilck, M

Dong, X., Fu, Y., Diao, S., Byeon, W., Chen, Z., Mahabalesh- warkar, A. S., Liu, S.-Y., Keirsbilck, M. V., Chen, M.-H., Suhara, Y., Lin, Y., Kautz, J., and Molchanov, P. Hymba: A hybrid-head architecture for small language models, 2024

work page 2024

[30] [30]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

P., Clark, J

Edalati, A., Tahaei, M., Kobyzev, I., Nia, V. P., Clark, J. J., and Rezagholizadeh, M. Krona: Parameter efficient tuning with kro- necker adapter. arXiv preprint arXiv:2212.10650 (2022)

work page arXiv 2022

[32] [32]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Eldan, R., and Li, Y.-F. Tinystories: How small can language models be and still speak coherent english? ArXiv abs/2305.07759 (2023)

work page internal anchor Pith review arXiv 2023

[33] [33]

Gptq: Accurate post-training quantization for generative pre-trained transform- ers, 2023

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transform- ers, 2023

work page 2023

[34] [34]

In International Conference on Machine Learning (2021)

Fu, C., Huang, H., Chen, X., Tian, Y., and Zhao, J.Learn-to-share: A hardware-friendly transfer learning framework exploiting computation and parameter sharing. In International Conference on Machine Learning (2021)

work page 2021

[35] [35]

Break the sequential dependency of llm inference using lookahead decoding

Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057 (2024)

work page arXiv 2024

[36] [36]

Fu, Y., Peng, H., Ou, L., Sabharwal, A., and Khot, T.Specializing smaller language models towards multi-step reasoning, 2023

work page 2023

[37] [37]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Fos- ter, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[38] [38]

Zamba: A compact 7b ssm hybrid model, 2024

Glorioso, P., Anthony, Q., Tokpanov, Y., Whittington, J., Pi- lault, J., Ibrahim, A., and Millidge, B. Zamba: A compact 7b ssm hybrid model, 2024. 31

work page 2024

[39] [39]

J., and Tao, D.Knowledge distillation: A survey

Gou, J., Yu, B., Maybank, S. J., and Tao, D.Knowledge distillation: A survey. International Journal of Computer Vision 129 , 6 (Mar. 2021), 1789–1819

work page 2021

[40] [40]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A., and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

D., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Singh Behl, H., Wang, X., Bubeck, S., Eldan, R., Kalai, A

Gunasekar, S., Zhang, Y., Aneja, J., Cesar, C., Mendes, T., Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Singh Behl, H., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. Textbooks are all you need, June 2023

work page 2023

[42] [42]

Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Guo, Z., Wang, P., Wang, Y., and Yu, S. Dr. llama: Improving small language models on pubmedqa via generative data augmentation. ArXiv abs/2305.07804 (2023)

work page arXiv 2023

[44] [44]

Improving small language models on pubmedqa via generative data augmentation

Guo, Z., Wang, P., Wang, Y., and Yu, S. Improving small language models on pubmedqa via generative data augmentation. arXiv, Jul 12 (2023)

work page 2023

[45] [45]

A., Mishra, S., Nakamura, M., Mitra, A., Mashetty, S., and Baral, C

Gupta, H., Sawant, S. A., Mishra, S., Nakamura, M., Mitra, A., Mashetty, S., and Baral, C. Instruction tuned models are quick learners, 2023

work page 2023

[46] [46]

V., Prabhala, H., Paul, S., and Platen, P

Gupta, Y., Jaddipal, V. V., Prabhala, H., Paul, S., and Platen, P. V. Progressive knowledge distillation of stable diffusion xl using layer level loss, 2024

work page 2024

[47] [47]

U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M

Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., et al. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints (2023)

work page 2023

[48] [48]

Neural machine translation of clinical text: An em- pirical investigation into multilingual pre-trained language models and transfer-learning, 2023

Han, L., Gladkoff, S., Erofeev, G., Sorokina, I., Galiano, B., and Nenadic, G. Neural machine translation of clinical text: An em- pirical investigation into multilingual pre-trained language models and transfer-learning, 2023

work page 2023

[49] [49]

Towards a unified view of parameter-efficient transfer learning

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 (2021). 32

work page arXiv 2021

[50] [50]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. X., and Steinhardt, J. Measuring massive multitask lan- guage understanding. ArXiv abs/2009.03300 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2009

[51] [51]

Distilling the knowledge in a neural network, 2015

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network, 2015

work page 2015

[52] [52]

Parameter-Efficient Transfer Learning for NLP

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for nlp. ArXiv abs/1902.00751 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1902

[53] [53]

Distilling step- by-step! outperforming larger language models with less training data and smaller model sizes, 2023

Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Rat- ner, A., Krishna, R., Lee, C.-Y., and Pfister, T. Distilling step- by-step! outperforming larger language models with less training data and smaller model sizes, 2023

work page 2023

[54] [54]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, J. E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models. ArXiv abs/2106.09685 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[55] [55]

Lawyer llama technical report, 2023

Huang, Q., Tao, M., Zhang, C., An, Z., Jiang, C., Chen, Z., Wu, Z., and Feng, Y. Lawyer llama technical report, 2023

work page 2023

[56] [56]

How good are low-bit quantized llama3 models? an empirical study, 2024

Huang, W., Ma, X., Qin, H., Zheng, X., Lv, C., Chen, H., Luo, J., Qi, X., Liu, X., and Magno, M. How good are low-bit quantized llama3 models? an empirical study, 2024

work page 2024

[57] [57]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chap- lot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Model pruning and deploy- ment optimization for ship detection

Jiang, Z., Chen, X., Gu, Y., and An, K. Model pruning and deploy- ment optimization for ship detection. In2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP) (Los Alamitos, CA, USA, apr 2023), IEEE Computer Society, pp. 1961–1968

work page 2023

[59] [59]

PubMedQA: A Dataset for Biomedical Research Question Answering

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. Pub- medqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019)

work page internal anchor Pith review arXiv 1909

[60] [60]

Flame: A small lan- guage model for spreadsheet formulas, 2023

Joshi, H., Ebenezer, A., Cambronero, J., Gulwani, S., Kanade, A., Le, V., Radi ˇcek, I., and Verbruggen, G. Flame: A small lan- guage model for spreadsheet formulas, 2023

work page 2023

[61] [61]

Prometheus: Inducing fine-grained evaluation capability in language models

Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., et al. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491 (2023). 33

work page arXiv 2023

[62] [62]

Inducing and exploiting activation sparsity for fast inference on deep neural networks

Kurtz, M., Kopinsky, J., Gelashvili, R., Matveev, A., Carr, J., Goin, M., Leiserson, W., Moore, S., Nell, B., Shavit, N., and Alistarh, D. Inducing and exploiting activation sparsity for fast inference on deep neural networks. InProceedings of the 37th International Conference on Machine Learning (Virtual, 13–18 Jul 2020), H. D. III and A. Singh, Eds., vo...

work page 2020

[63] [63]

Can small language models help large language models reason better?: Lm-guided chain-of-thought, 2024

Lee, J., Yang, F., Tran, T., Hu, Q., Barut, E., Chang, K.-W., and Su, C. Can small language models help large language models reason better?: Lm-guided chain-of-thought, 2024

work page 2024

[64] [64]

H., Kim, G., and Seo, M

Lee, S., Kim, S., Park, S. H., Kim, G., and Seo, M. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation.arXiv preprint arXiv:2401.06591 (2024)

work page arXiv 2024

[65] [65]

The power of scale for parameter-efficient prompt tuning

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing (2021)

work page 2021

[66] [66]

Fast inference from transformers via speculative decoding

Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning (2023), PMLR, pp. 19274–19286

work page 2023

[67] [67]

H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y

Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y. Symbolic chain-of-thought distillation: Small models can also ”think” step-by-step, 2023

work page 2023

[68] [68]

StarCoder: may the source be with you!

Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q., Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O., Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko, O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Umapathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang, Z., Murthy, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Li, X. L., and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) abs/2101.00190 (2021). 34

work page internal anchor Pith review Pith/arXiv arXiv 2021

[70] [70]

D., Gunasekar, S., and Lee, Y

Li, Y., Bubeck, S., Eldan, R., Giorno, A. D., Gunasekar, S., and Lee, Y. T. Textbooks are all you need ii: phi-1.5 technical report. September 2023

work page 2023

[71] [71]

Large language models in finance: A survey

Li, Y., Wang, S., Ding, H., and Chen, H. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance (2023), pp. 374–382

work page 2023

[72] [72]

Jamba: A hybrid transformer- mamba language model, 2024

Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedi- gos, I., Safahi, E., Meirom, S., Belinkov, Y., Shalev-Shwartz, S., Abend, O., Alon, R., Asida, T., Bergman, A., Glozman, R., Gokhman, M., Manevich, A., Ratner, N., Rozen, N., Shwartz, E., Zusman, M., and Shoham, Y. Jamba: A hybrid transformer- mamba language model, 2024

work page 2024

[73] [73]

Awq: Activation-aware weight quantization for llm compression and acceleration, 2023

Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023

work page 2023

[74] [74]

Tinygsm: achieving ¿80% on gsm8k with small language models

Liu, B., Bubeck, S., Eldan, R., Kulkarni, J., Li, Y., Nguyen, A., Ward, R., and Zhang, Y. Tinygsm: achieving ¿80% on gsm8k with small language models. ArXiv abs/2312.09241 (2023)

work page arXiv 2023

[75] [75]

A.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. A.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35 (2022), 1950–1965

work page 2022

[76] [76]

F., Cheng, K.-T., and Chen, M.-H

Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., and Chen, M.-H. Dora: Weight-decomposed low-rank adaptation, 2024

work page 2024

[77] [77]

Finbert: A pre-trained financial language representation model for financial text mining

Liu, Z., Huang, D., Huang, K., Li, Z., and Zhao, J. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence (2021), pp. 4513– 4519

work page 2021

[78] [78]

W., Tay, Y., Zhou, D., Le, Q

Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., and Roberts, A. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning (2023)

work page 2023

[79] [79]

Blending is all you need: Cheaper, better alternative to trillion-parameters llm

Lu, X., Liusie, A., Raina, V., Zhang, Y., and Beauchamp, W. Blending is all you need: Cheaper, better alternative to trillion-parameters llm

work page

[80] [80]

Exploring small language models with prompt-learning paradigm for efficient domain-specific text classification, 2023

Luo, H., Liu, P., and Esping, S. Exploring small language models with prompt-learning paradigm for efficient domain-specific text classification, 2023. 35

work page 2023