pith. sign in

arxiv: 2501.05465 · v2 · pith:2A3QF3NHnew · submitted 2025-01-03 · 💻 cs.CL

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

Pith reviewed 2026-05-23 05:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords small language modelsmodel efficiencyperformance comparisonlanguage model surveyparameter scalingtask-specific models
0
0 comments X

The pith

Small language models with 1 to 8 billion parameters can match or outperform much larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews roughly 160 papers to argue that Small Language Models in the 1-8 billion parameter range achieve performance equal to or better than larger models on many tasks. It examines both general-purpose and task-specific SLMs along with training techniques that trade off performance against efficiency, scalability, and cost. The work also introduces the concept of effective sizes to measure real capability gains relative to bigger models. A sympathetic reader would care because this challenges the assumption that bigger is always better and points toward more practical paths for building capable systems.

Core claim

The paper establishes through its survey that a family of SLMs sized 1 to 8 billion parameters can perform as well as or even outperform large models, while defining and characterizing their effective sizes to represent increased capability with respect to LLMs.

What carries the argument

The survey of approximately 160 works on SLMs together with the definition of effective sizes that quantify capability beyond raw parameter count.

If this is right

  • Task-specific SLMs can deliver targeted performance without the overhead of full-scale models.
  • Techniques for creating SLMs allow balancing of accuracy, speed, and resource use.
  • Effective size metrics provide a way to compare models that accounts for real capability rather than parameter count alone.
  • General-purpose SLMs in this range become viable alternatives for many applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Development effort may shift toward architecture and data choices that maximize performance per parameter rather than raw scale.
  • Smaller models could enable wider deployment on consumer hardware or in low-resource settings.
  • Hybrid systems that combine multiple SLMs might achieve results previously thought to require single large models.

Load-bearing premise

The selected papers are representative of the field and their reported performance numbers are directly comparable across different model scales and evaluation setups.

What would settle it

A controlled head-to-head benchmark that trains and evaluates multiple SLMs and LLMs on the exact same tasks with identical data, metrics, and hardware would show larger models consistently superior.

Figures

Figures reproduced from arXiv: 2501.05465 by Akanksha Gupta, Bijo Thomas, Harshita Asnani, Mecit Gungor, Phanindra Reddy Madduru, Samia Feroze, Shreyas Subramanian, Vikram Elango.

Figure 1
Figure 1. Figure 1: Mind map of topics covered in the paper skills, including some reasoning and understanding. The extent to which small models can achieve this remains uncertain. For instance, the TinyStories model (10M parameters) successfully generated coherent English stories using a syn￾thetic dataset created by larger models (GPT 3.5 and GPT 4). However, as indicated in the preliminary [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 2
Figure 2. Figure 2: Equivalent sizes of SLMs based on performance benchmarks; more [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗
read the original abstract

As foundation AI models continue to increase in size, an important question arises - is massive scale the only path forward? This survey of about 160 papers presents a family of Small Language Models (SLMs) in the 1 to 8 billion parameter range that demonstrate smaller models can perform as well, or even outperform large models. We explore task agnostic, general purpose SLMs, task-specific SLMs and techniques to create SLMs that can guide the community to build models while balancing performance, efficiency, scalability and cost. Furthermore we define and characterize SLMs' effective sizes, representing increased capability with respect to LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This survey reviews ~160 papers on Small Language Models (SLMs) in the 1-8B parameter range. It claims that such SLMs can match or outperform larger models on tasks, examines task-agnostic and task-specific SLMs plus creation techniques, and introduces the notion of 'effective sizes' that characterize SLM capability relative to LLMs.

Significance. If the surveyed results prove representative and comparable, the work would supply concrete counter-evidence to strict scaling hypotheses, supporting research into efficient, lower-cost models and broadening access to capable language technology.

major comments (3)
  1. [Introduction and Survey Scope] The central claim that SLMs can match or exceed LLMs rests on the representativeness of the ~160 selected papers, yet the manuscript provides no explicit inclusion/exclusion criteria, search protocol, or discussion of publication bias (Introduction and Survey Scope sections).
  2. [Performance Comparison and Results] Performance numbers drawn from the cited works are treated as directly comparable, but the text contains no normalization for differences in training data volume, compute budget, benchmark versions, or evaluation protocols, which directly affects the validity of cross-scale claims (Performance Comparison and Results sections).
  3. [Discussion and Limitations] No systematic treatment of counterexamples or negative results is presented; without this, selective highlighting of positive SLM outcomes cannot be ruled out as the driver of the headline conclusion (Discussion and Limitations sections).
minor comments (2)
  1. [Title] The parenthetical '(updated 2026)' in the title is unclear and should be explained or corrected.
  2. [Figures and Tables] Figure captions and table headers would benefit from explicit statements of the exact metrics and model scales being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our survey. We address each major comment below and commit to revisions that strengthen the manuscript's transparency and balance without altering its core contributions.

read point-by-point responses
  1. Referee: The central claim that SLMs can match or exceed LLMs rests on the representativeness of the ~160 selected papers, yet the manuscript provides no explicit inclusion/exclusion criteria, search protocol, or discussion of publication bias (Introduction and Survey Scope sections).

    Authors: We agree that these methodological details were insufficiently explicit. In the revised manuscript, we will add a dedicated subsection under Survey Scope that specifies the search protocol (keywords, databases, date range 2020-2025), inclusion criteria (models strictly 1-8B parameters with reported LLM comparisons on standard benchmarks), exclusion criteria (non-comparative studies, non-English papers, duplicate reports), and a short discussion of publication bias acknowledging that positive results may be over-represented in the literature. revision: yes

  2. Referee: Performance numbers drawn from the cited works are treated as directly comparable, but the text contains no normalization for differences in training data volume, compute budget, benchmark versions, or evaluation protocols, which directly affects the validity of cross-scale claims (Performance Comparison and Results sections).

    Authors: The referee is correct that unnormalized comparisons introduce uncertainty. We will revise the Performance Comparison section to add an explicit limitations paragraph that (a) enumerates the sources of incomparability, (b) reports available training details (data volume, compute) for the most-cited examples, and (c) cautions readers that headline claims are indicative rather than definitive. Full statistical normalization is not feasible within a survey format, so this will be framed as a methodological limitation. revision: partial

  3. Referee: No systematic treatment of counterexamples or negative results is presented; without this, selective highlighting of positive SLM outcomes cannot be ruled out as the driver of the headline conclusion (Discussion and Limitations sections).

    Authors: We accept this criticism. The revised Discussion and Limitations section will contain a new subsection titled 'Counterexamples and Negative Results' that reviews documented cases where SLMs underperform (e.g., long-context reasoning, certain multilingual tasks) and cites papers showing continued benefits from scale. This addition will make the survey more balanced and reduce the risk of perceived selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey with no derivations or fitted predictions

full rationale

This is a survey paper summarizing ~160 existing works on small language models (1-8B parameters). It contains no original mathematical derivations, equations, parameter fittings, uniqueness theorems, or ansatzes that could reduce to self-referential inputs. The central claim is an aggregation of reported results from the surveyed literature rather than a constructed prediction or self-defined quantity. No load-bearing self-citations or renamings of known results are present in a way that matches the enumerated circularity patterns. The paper is self-contained as a review and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no new mathematical derivations, models, or empirical claims introduced by the authors.

pith-pipeline@v0.9.0 · 5666 in / 910 out tokens · 52483 ms · 2026-05-23T05:45:53.650870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

    cs.PF 2025-08 unverdicted novelty 5.0

    ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference wh...

  2. Small Language Models are the Future of Agentic AI

    cs.AI 2025-06 unverdicted novelty 5.0

    Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.

  3. SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

    cs.LG 2026-04 unverdicted novelty 3.0

    Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

Reference graph

Works this paper leans on

161 extracted references · 161 canonical work pages · cited by 3 Pith papers · 33 internal anchors

  1. [1]

    Mixtral of Experts

    Mixtral of experts, 2023, Jiang, Albert Q., Alexandre Sablayrolles, An- toine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot et al. arXiv preprint arXiv:2401.04088 (2024)

  2. [2]

    Openllm leaderboard, huggingface, 2024

  3. [3]

    Phi-2: The surprising power of small language models, 2024

  4. [4]

    Smollm - blazingly fast and remarkably powerful, 2024

  5. [5]

    J., Javaheripi, M., Kauffmann, P., Lee, J

    Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gu- nasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Wang, X., Ward, R., Wu, Y., Yu, D., Zhang, C., and Zhang, Y. Phi-4 technical report, 2024

  6. [6]

    S., Kalai, T., Wanf, X., Ward, R., Witte, P., Zhang, C., and Zhang, Y

    Abdin, M., Aneja, J., Bubeck, S., C ´esar, C., Mendes, T., Chen, W., Del Giorno, Allie abd Eldan, R., Gopi, S., Gunasekar, S., Javaheripi, M., Kauffmann, Piero abd Tat Lee, Y., Li, Yuanzhi ans Nguyen, A., de Rosa, G., Saarikivi, O., Salim, Adil a Shi- tal Shah, S., Santacroce, M., Behl, H. S., Kalai, T., Wanf, X., Ward, R., Witte, P., Zhang, C., and Zhang...

  7. [7]

    A., Awan, A

    Abdin, M., Jacobs, S. A., Awan, A. A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C. C. T., Chen, W., Chaudhary, V., Chopra, P., Giorno, A. D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J....

  8. [8]

    Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning

    Aghajanyan, A., Zettlemoyer, L., and Gupta, S. Intrinsic dimen- sionality explains the effectiveness of language model fine-tuning. ArXiv abs/2012.13255 (2020). 28

  9. [9]

    Finbert: Financial sentiment analysis with pre-trained lan- guage models, 2019

    Araci, D. Finbert: Financial sentiment analysis with pre-trained lan- guage models, 2019

  10. [10]

    Armengol-Estap´e, J., Woodruff, J., Cummins, C., and O’Boyle, M. F. P. Slade: A portable small language model decompiler for opti- mized assembly, 2024

  11. [11]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  12. [12]

    Speculative streaming: Fast llm inference without auxiliary models

    Bhendawade, N., Belousova, I., Fu, Q., Mason, H., Rastegari, M., and Najibi, M. Speculative streaming: Fast llm inference without auxiliary models. arXiv preprint arXiv:2402.11131 (2024)

  13. [13]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language. ArXiv abs/1911.11641 (2019)

  14. [14]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024)

  15. [15]

    Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu

    Cao, J., Prakash, C., and Hamza, W. Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu. InNAACL-HLT (2022)

  16. [16]

    Speedupnet: A plug-and-play hyper-network for accelerating text- to-image diffusion models

    Chai, W., Zheng, D., Cao, J., Chen, Z., Wang, C., and Ma, C. Speedupnet: A plug-and-play hyper-network for accelerating text- to-image diffusion models. arXiv preprint arXiv:2312.08887 (2023)

  17. [17]

    Parameter-efficient fine-tuning design spaces

    Chen, J., Zhang, A., Shi, X., Li, M., Smola, A., and Yang, D. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821 (2023)

  18. [18]

    Evaluating Large Language Models Trained on Code

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D. W., Plappert, M., Chantzis, F., ...

  19. [19]

    W., Sutton, C., Gehrmann, S., et al

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113

  20. [20]

    W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B. C., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., ...

  21. [21]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv abs/1803.05457 (2018)

  22. [22]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. ArXiv abs/2110.14168 (2021)

  23. [23]

    Fleurs: Few-shot learning evaluation of universal representations of speech

    Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., and Bapna, A. Fleurs: Few-shot learning evaluation of universal representations of speech. 2022 IEEE Spoken Language Technology Workshop (SLT) (2022), 798–805

  24. [24]

    UltraFeedback: Boosting Language Models with Scaled AI Feedback

    Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., and Sun, M. Ultrafeedback: Boosting language models with high-quality feedback. ArXiv abs/2310.01377 (2023)

  25. [25]

    Chatlaw: Open- source legal large language model with integrated external knowledge bases, 2023

    Cui, J., Li, Z., Yan, Y., Chen, B., and Yuan, L. Chatlaw: Open- source legal large language model with integrated external knowledge bases, 2023

  26. [26]

    Compacter: Efficient low-rank hypercomplex adapter layers

    Davison, J. Compacter: Efficient low-rank hypercomplex adapter layers. In Neural Information Processing Systems (2021)

  27. [27]

    S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

    Dey, N., Gosal, G. S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J. Cerebras-gpt: Open 30 compute-optimal language models trained on the cerebras wafer-scale clus- ter. ArXiv abs/2304.03208 (2023)

  28. [28]

    Enhancing chat language models by scaling high- quality instructional conversations

    Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high- quality instructional conversations. In Conference on Empirical Methods in Natural Language Processing (2023)

  29. [29]

    S., Liu, S.-Y., Keirsbilck, M

    Dong, X., Fu, Y., Diao, S., Byeon, W., Chen, Z., Mahabalesh- warkar, A. S., Liu, S.-Y., Keirsbilck, M. V., Chen, M.-H., Suhara, Y., Lin, Y., Kautz, J., and Molchanov, P. Hymba: A hybrid-head architecture for small language models, 2024

  30. [30]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  31. [31]

    P., Clark, J

    Edalati, A., Tahaei, M., Kobyzev, I., Nia, V. P., Clark, J. J., and Rezagholizadeh, M. Krona: Parameter efficient tuning with kro- necker adapter. arXiv preprint arXiv:2212.10650 (2022)

  32. [32]

    TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    Eldan, R., and Li, Y.-F. Tinystories: How small can language models be and still speak coherent english? ArXiv abs/2305.07759 (2023)

  33. [33]

    Gptq: Accurate post-training quantization for generative pre-trained transform- ers, 2023

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transform- ers, 2023

  34. [34]

    In International Conference on Machine Learning (2021)

    Fu, C., Huang, H., Chen, X., Tian, Y., and Zhao, J.Learn-to-share: A hardware-friendly transfer learning framework exploiting computation and parameter sharing. In International Conference on Machine Learning (2021)

  35. [35]

    Break the sequential dependency of llm inference using lookahead decoding

    Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057 (2024)

  36. [36]

    Fu, Y., Peng, H., Ou, L., Sabharwal, A., and Khot, T.Specializing smaller language models towards multi-step reasoning, 2023

  37. [37]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Fos- ter, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020)

  38. [38]

    Zamba: A compact 7b ssm hybrid model, 2024

    Glorioso, P., Anthony, Q., Tokpanov, Y., Whittington, J., Pi- lault, J., Ibrahim, A., and Millidge, B. Zamba: A compact 7b ssm hybrid model, 2024. 31

  39. [39]

    J., and Tao, D.Knowledge distillation: A survey

    Gou, J., Yu, B., Maybank, S. J., and Tao, D.Knowledge distillation: A survey. International Journal of Computer Vision 129 , 6 (Mar. 2021), 1789–1819

  40. [40]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A., and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)

  41. [41]

    D., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Singh Behl, H., Wang, X., Bubeck, S., Eldan, R., Kalai, A

    Gunasekar, S., Zhang, Y., Aneja, J., Cesar, C., Mendes, T., Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Singh Behl, H., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. Textbooks are all you need, June 2023

  42. [42]

    Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644 (2023)

  43. [43]

    Guo, Z., Wang, P., Wang, Y., and Yu, S. Dr. llama: Improving small language models on pubmedqa via generative data augmentation. ArXiv abs/2305.07804 (2023)

  44. [44]

    Improving small language models on pubmedqa via generative data augmentation

    Guo, Z., Wang, P., Wang, Y., and Yu, S. Improving small language models on pubmedqa via generative data augmentation. arXiv, Jul 12 (2023)

  45. [45]

    A., Mishra, S., Nakamura, M., Mitra, A., Mashetty, S., and Baral, C

    Gupta, H., Sawant, S. A., Mishra, S., Nakamura, M., Mitra, A., Mashetty, S., and Baral, C. Instruction tuned models are quick learners, 2023

  46. [46]

    V., Prabhala, H., Paul, S., and Platen, P

    Gupta, Y., Jaddipal, V. V., Prabhala, H., Paul, S., and Platen, P. V. Progressive knowledge distillation of stable diffusion xl using layer level loss, 2024

  47. [47]

    U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M

    Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., et al. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints (2023)

  48. [48]

    Neural machine translation of clinical text: An em- pirical investigation into multilingual pre-trained language models and transfer-learning, 2023

    Han, L., Gladkoff, S., Erofeev, G., Sorokina, I., Galiano, B., and Nenadic, G. Neural machine translation of clinical text: An em- pirical investigation into multilingual pre-trained language models and transfer-learning, 2023

  49. [49]

    Towards a unified view of parameter-efficient transfer learning

    He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 (2021). 32

  50. [50]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. X., and Steinhardt, J. Measuring massive multitask lan- guage understanding. ArXiv abs/2009.03300 (2020)

  51. [51]

    Distilling the knowledge in a neural network, 2015

    Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network, 2015

  52. [52]

    Parameter-Efficient Transfer Learning for NLP

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for nlp. ArXiv abs/1902.00751 (2019)

  53. [53]

    Distilling step- by-step! outperforming larger language models with less training data and smaller model sizes, 2023

    Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Rat- ner, A., Krishna, R., Lee, C.-Y., and Pfister, T. Distilling step- by-step! outperforming larger language models with less training data and smaller model sizes, 2023

  54. [54]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, J. E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models. ArXiv abs/2106.09685 (2021)

  55. [55]

    Lawyer llama technical report, 2023

    Huang, Q., Tao, M., Zhang, C., An, Z., Jiang, C., Chen, Z., Wu, Z., and Feng, Y. Lawyer llama technical report, 2023

  56. [56]

    How good are low-bit quantized llama3 models? an empirical study, 2024

    Huang, W., Ma, X., Qin, H., Zheng, X., Lv, C., Chen, H., Luo, J., Qi, X., Liu, X., and Magno, M. How good are low-bit quantized llama3 models? an empirical study, 2024

  57. [57]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chap- lot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825 (2023)

  58. [58]

    Model pruning and deploy- ment optimization for ship detection

    Jiang, Z., Chen, X., Gu, Y., and An, K. Model pruning and deploy- ment optimization for ship detection. In2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP) (Los Alamitos, CA, USA, apr 2023), IEEE Computer Society, pp. 1961–1968

  59. [59]

    PubMedQA: A Dataset for Biomedical Research Question Answering

    Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., and Lu, X. Pub- medqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019)

  60. [60]

    Flame: A small lan- guage model for spreadsheet formulas, 2023

    Joshi, H., Ebenezer, A., Cambronero, J., Gulwani, S., Kanade, A., Le, V., Radi ˇcek, I., and Verbruggen, G. Flame: A small lan- guage model for spreadsheet formulas, 2023

  61. [61]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Kim, S., Shin, J., Cho, Y., Jang, J., Longpre, S., Lee, H., Yun, S., Shin, S., Kim, S., Thorne, J., et al. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491 (2023). 33

  62. [62]

    Inducing and exploiting activation sparsity for fast inference on deep neural networks

    Kurtz, M., Kopinsky, J., Gelashvili, R., Matveev, A., Carr, J., Goin, M., Leiserson, W., Moore, S., Nell, B., Shavit, N., and Alistarh, D. Inducing and exploiting activation sparsity for fast inference on deep neural networks. InProceedings of the 37th International Conference on Machine Learning (Virtual, 13–18 Jul 2020), H. D. III and A. Singh, Eds., vo...

  63. [63]

    Can small language models help large language models reason better?: Lm-guided chain-of-thought, 2024

    Lee, J., Yang, F., Tran, T., Hu, Q., Barut, E., Chang, K.-W., and Su, C. Can small language models help large language models reason better?: Lm-guided chain-of-thought, 2024

  64. [64]

    H., Kim, G., and Seo, M

    Lee, S., Kim, S., Park, S. H., Kim, G., and Seo, M. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation.arXiv preprint arXiv:2401.06591 (2024)

  65. [65]

    The power of scale for parameter-efficient prompt tuning

    Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing (2021)

  66. [66]

    Fast inference from transformers via speculative decoding

    Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning (2023), PMLR, pp. 19274–19286

  67. [67]

    H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y

    Li, L. H., Hessel, J., Yu, Y., Ren, X., Chang, K.-W., and Choi, Y. Symbolic chain-of-thought distillation: Small models can also ”think” step-by-step, 2023

  68. [68]

    StarCoder: may the source be with you!

    Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q., Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O., Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko, O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Umapathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang, Z., Murthy, ...

  69. [69]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Li, X. L., and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) abs/2101.00190 (2021). 34

  70. [70]

    D., Gunasekar, S., and Lee, Y

    Li, Y., Bubeck, S., Eldan, R., Giorno, A. D., Gunasekar, S., and Lee, Y. T. Textbooks are all you need ii: phi-1.5 technical report. September 2023

  71. [71]

    Large language models in finance: A survey

    Li, Y., Wang, S., Ding, H., and Chen, H. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance (2023), pp. 374–382

  72. [72]

    Jamba: A hybrid transformer- mamba language model, 2024

    Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedi- gos, I., Safahi, E., Meirom, S., Belinkov, Y., Shalev-Shwartz, S., Abend, O., Alon, R., Asida, T., Bergman, A., Glozman, R., Gokhman, M., Manevich, A., Ratner, N., Rozen, N., Shwartz, E., Zusman, M., and Shoham, Y. Jamba: A hybrid transformer- mamba language model, 2024

  73. [73]

    Awq: Activation-aware weight quantization for llm compression and acceleration, 2023

    Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023

  74. [74]

    Tinygsm: achieving ¿80% on gsm8k with small language models

    Liu, B., Bubeck, S., Eldan, R., Kulkarni, J., Li, Y., Nguyen, A., Ward, R., and Zhang, Y. Tinygsm: achieving ¿80% on gsm8k with small language models. ArXiv abs/2312.09241 (2023)

  75. [75]

    A.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

    Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. A.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35 (2022), 1950–1965

  76. [76]

    F., Cheng, K.-T., and Chen, M.-H

    Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., and Chen, M.-H. Dora: Weight-decomposed low-rank adaptation, 2024

  77. [77]

    Finbert: A pre-trained financial language representation model for financial text mining

    Liu, Z., Huang, D., Huang, K., Li, Z., and Zhao, J. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence (2021), pp. 4513– 4519

  78. [78]

    W., Tay, Y., Zhou, D., Le, Q

    Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., and Roberts, A. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning (2023)

  79. [79]

    Blending is all you need: Cheaper, better alternative to trillion-parameters llm

    Lu, X., Liusie, A., Raina, V., Zhang, Y., and Beauchamp, W. Blending is all you need: Cheaper, better alternative to trillion-parameters llm

  80. [80]

    Exploring small language models with prompt-learning paradigm for efficient domain-specific text classification, 2023

    Luo, H., Liu, P., and Esping, S. Exploring small language models with prompt-learning paradigm for efficient domain-specific text classification, 2023. 35

Showing first 80 references.