pith. machine review for the scientific record. sign in

arxiv: 2402.06196 · v3 · submitted 2024-02-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Large Language Models: A Survey

Jianfeng Gao, Meysam Chenaghlu, Narjes Nikzad, Richard Socher, Shervin Minaee, Tomas Mikolov, Xavier Amatriain

Pith reviewed 2026-05-11 15:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelssurveyGPTLLaMAPaLMscaling lawsbenchmarksevaluation
0
0 comments X

The pith

Large language models acquire general-purpose understanding by training billions of parameters on massive text data as predicted by scaling laws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how large language models develop broad capabilities in language understanding and generation through training on enormous text collections. It focuses on three major model families—GPT, LLaMA, and PaLM—while outlining their distinct features, advances, and shortcomings. The survey also covers methods for building and improving these models, the datasets used for training and testing, standard evaluation metrics, and comparative performance results across representative tasks. It closes by noting current limitations and directions for further work in this fast-moving area.

Core claim

LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws. The paper surveys prominent models from the GPT, LLaMA, and PaLM families, discusses techniques for constructing and augmenting LLMs, reviews training and evaluation datasets along with common metrics, compares performance on benchmarks, and identifies open challenges.

What carries the argument

Scaling laws relating model performance to parameter count and training data volume, which the paper uses to frame the review of LLM families and their development.

If this is right

  • Techniques for building and augmenting LLMs can be applied to improve performance on specific downstream tasks.
  • Benchmark comparisons highlight which model families excel in particular areas of language processing.
  • Discussion of limitations points to concrete areas where future model development should focus.
  • Overview of datasets and metrics provides a basis for consistent evaluation across new models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rapid changes in the field may require periodic updates to the survey to maintain relevance for practitioners.
  • Insights on model limitations could guide efforts to create more efficient versions that use fewer resources while retaining capabilities.
  • Connections between scaling and emergent abilities suggest testing whether further increases in size produce qualitatively new behaviors beyond current benchmarks.

Load-bearing premise

The selection of prominent LLMs and representative benchmarks accurately reflects the field's current state without major omissions or bias.

What would settle it

A new model family or benchmark set that was omitted from the survey but shows substantially different performance patterns or violates the scaling predictions on the same tasks.

read the original abstract

Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript surveys large language models (LLMs), noting that their general-purpose language understanding and generation capabilities arise from training billions of parameters on massive text corpora in line with scaling laws. It reviews prominent LLM families (GPT, LLaMA, PaLM), their characteristics, contributions, and limitations; overviews techniques for building and augmenting LLMs; surveys datasets for training, fine-tuning, and evaluation; reviews evaluation metrics; compares several LLMs on representative benchmarks; and discusses open challenges and future directions.

Significance. If the summaries remain faithful to the cited sources, the survey provides a structured entry point into the post-ChatGPT LLM literature. Its value lies in consolidating model families, techniques, datasets, metrics, and benchmark results into one document, which can help researchers track the field's rapid evolution without needing to consult dozens of primary papers. The explicit linkage to scaling laws and the inclusion of performance comparisons add practical utility for both newcomers and specialists.

major comments (1)
  1. [Abstract and Introduction] Abstract and §1 (Introduction): the selection of 'some of the most prominent LLMs' and the specific families (GPT, LLaMA, PaLM) plus benchmarks is presented without explicit inclusion/exclusion criteria or a justification of coverage breadth. This choice directly affects the survey's representativeness and risks author-specific bias, which is load-bearing for a descriptive review whose central contribution is organizational completeness.
minor comments (3)
  1. [References] Ensure that all cited works (e.g., Kaplan et al. 2020, Hoffmann et al. 2022) are listed in the bibliography with complete and consistent formatting, including arXiv identifiers or DOIs where applicable.
  2. [Evaluation section] Benchmark comparison tables would benefit from an explicit statement of the evaluation date or model versions used, given the rapid release cadence of new LLMs.
  3. [Figures and Tables] Figure captions and table legends should be expanded to be self-contained, specifying what each column/row represents without requiring reference to the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. The feedback on selection criteria is constructive and we address it directly below.

read point-by-point responses
  1. Referee: [Abstract and Introduction] Abstract and §1 (Introduction): the selection of 'some of the most prominent LLMs' and the specific families (GPT, LLaMA, PaLM) plus benchmarks is presented without explicit inclusion/exclusion criteria or a justification of coverage breadth. This choice directly affects the survey's representativeness and risks author-specific bias, which is load-bearing for a descriptive review whose central contribution is organizational completeness.

    Authors: We agree that the absence of explicit inclusion/exclusion criteria reduces transparency. In the revised version we will insert a new paragraph at the end of §1 that states our selection rationale: we focus on three families that (i) exemplify distinct development paradigms (closed-source scaling in GPT, open-source accessibility in LLaMA, and efficient large-scale training in PaLM), (ii) have been cited extensively in the post-ChatGPT literature, and (iii) together cover the dominant architectural and training choices discussed in the survey. Benchmarks were chosen as those most frequently reported across the cited primary papers for core capabilities (reasoning, knowledge, instruction following). We explicitly note that the survey is not exhaustive and that many other models exist; the chosen set is intended to illustrate representative trends rather than to claim completeness. This addition directly mitigates the risk of perceived author-specific bias while preserving the survey's scope. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey of external literature only

full rationale

This paper is explicitly a survey that organizes and summarizes existing external work on LLMs (GPT, LLaMA, PaLM families), techniques, datasets, metrics, and benchmarks. Its central statements cite scaling laws to Kaplan et al. (2020) and Hoffmann et al. (2022) with no self-citation load-bearing on any claim. No equations, new predictions, fitted parameters, or derivations appear; the text frames all content as review rather than novel technical assertion. No step reduces by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, it relies entirely on cited prior literature for all content; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5499 in / 1126 out tokens · 60744 ms · 2026-05-11T15:17:37.767830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Variance-aware Reward Modeling with Anchor Guidance

    stat.ML 2026-05 unverdicted novelty 7.0

    Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...

  2. Logic-Regularized Verifier Elicits Reasoning from LLMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.

  3. PrivacyAssist: A User-Centric Agent Framework for Detecting Privacy Inconsistencies in Android Apps

    cs.CR 2026-04 unverdicted novelty 7.0

    PrivacyAssist uses multi-agent LLMs and RAG to detect mismatches between Android app permissions and declared data practices, finding only 16% of 2,347 apps fully consistent.

  4. Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

    hep-ph 2026-04 unverdicted novelty 7.0

    The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.

  5. SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

  6. Cross-Modal Bayesian Low-Rank Adaptation for Uncertainty-Aware Multimodal Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    CALIBER conditions the variational posterior of low-rank adapters on token-level cross-attention between text and audio to produce uncertainty-aware multimodal parameter-efficient fine-tuning.

  7. NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

    cs.DB 2026-04 conditional novelty 7.0

    NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.

  8. Unified Compression Algorithm for Distributed Nonconvex Optimization: Generalized to 1-Bit, Saturation, and Bounded Noise

    math.OC 2026-04 unverdicted novelty 7.0

    A unified compression algorithm for distributed nonconvex optimization achieves O(1/sqrt(T)) convergence for locally-bounded compressors, matching centralized 1-bit methods, with an improved O(1/T^{2/3}) rate after on...

  9. Mechanism Design for Quality-Preserving LLM Advertising

    cs.GT 2026-05 unverdicted novelty 6.0

    A quality-preserving auction framework for LLM advertising uses RAG-based endogenous reserves and KL-regularized or screened VCG mechanisms to achieve DSIC, IR, higher revenue, and better semantic fidelity than baselines.

  10. Continuous Latent Diffusion Language Model

    cs.CL 2026-05 unverdicted novelty 6.0

    Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...

  11. Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems

    cs.IR 2026-05 unverdicted novelty 6.0

    Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.

  12. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

    cs.AI 2026-05 unverdicted novelty 6.0

    FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.

  13. OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

    q-bio.NC 2026-04 unverdicted novelty 6.0

    OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.

  14. Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoCo-LoRA uses audio context to modulate uncertainty in Bayesian low-rank adapters for multimodal text tasks, offering a lightweight alternative to feature fusion that matches or exceeds baselines.

  15. ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning

    cs.IR 2026-04 unverdicted novelty 6.0

    ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.

  16. PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection

    cs.AI 2026-04 unverdicted novelty 6.0

    PRISM-MCTS improves MCTS-based reasoning efficiency by maintaining a shared memory of heuristics and fallacies reinforced by a process reward model, halving required trajectories on GPQA while outperforming prior methods.

  17. SAM 3D: 3Dfy Anything in Images

    cs.CV 2025-11 unverdicted novelty 6.0

    SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

  18. Context Convergence Improves Answering Inferential Questions

    cs.CL 2026-05 unverdicted novelty 5.0

    Passages made from high-convergence sentences improve LLM performance on inferential questions compared to cosine similarity selection.

  19. Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Model

    cs.CL 2026-05 unverdicted novelty 5.0

    Expert re-annotations of a German ABSA dataset serve as ground truth to evaluate how students, crowdworkers, and LLMs affect inter-annotator agreement and downstream performance on ACSA and TASD tasks using BERT, T5, ...

  20. Revisiting General Map Search via Generative Point-of-Interest Retrieval

    cs.IR 2026-05 unverdicted novelty 5.0

    GenPOI is a generative POI retrieval system that unifies heterogeneous contexts via LLMs, uses geo-semantic tokenization, and applies proximity constraints to achieve superior performance on large-scale map search data.

  21. Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

    cs.LG 2026-05 conditional novelty 5.0

    Activation-aware pruning preserves perplexity but amplifies bias in LLMs, with 47-59% of previously neutral items developing new stereotypical responses at 70% sparsity.

  22. A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations

    cs.CR 2026-04 unverdicted novelty 5.0

    A survey that introduces a unified training pipeline and taxonomizes split learning approaches for LLM fine-tuning across model, system, and privacy dimensions.

  23. Automated LTL Specification Generation from Industrial Aerospace Requirements

    cs.SE 2026-04 unverdicted novelty 5.0

    AeroReq2LTL automates LTL generation from industrial aerospace requirements via LLMs with a data dictionary and templates, achieving 85% precision and 88% recall on real data.

  24. Calibrating Model-Based Evaluation Metrics for Summarization

    cs.CL 2026-04 unverdicted novelty 5.0

    A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

  25. On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework

    cs.IR 2026-04 unverdicted novelty 5.0

    Quantum-inspired 1024-D document embeddings exhibit weak, unstable ranking performance and structural geometric limitations, performing better as auxiliary components in hybrid lexical-embedding retrieval systems.

  26. The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior

    cs.LG 2026-04 unverdicted novelty 5.0

    Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.

  27. Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

    cs.CL 2026-05 unverdicted novelty 4.0

    LLM-based augmentation of the minority class in a Bangla fake news dataset, using high rates and random subsampling, improves F1 score from 0.85 to 0.88.

  28. LLM-Enhanced Topical Trend Detection at Snapchat

    cs.IR 2026-04 unverdicted novelty 4.0

    Snapchat's deployed system detects emerging topical trends in short videos via multimodal extraction, time-series burst detection, and LLM consolidation, achieving high precision per six months of human evaluation and...

  29. An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation

    cs.CL 2026-04 unverdicted novelty 4.0

    A two-stage hybrid search pipeline paired with a synthetic-data fine-tuned and compressed Ukrainian language model delivers competitive local question answering under strict compute limits.

  30. Enhancing Mental Health Counseling Support in Bangladesh using Culturally-Grounded Knowledge

    cs.AI 2026-04 unverdicted novelty 4.0

    A clinically validated knowledge graph built for Bangladeshi stressors and interventions improves LLM counseling responses over standard RAG in contextual relevance and clinical appropriateness.

  31. Network Effects and Agreement Drift in LLM Debates

    cs.SI 2026-04 unverdicted novelty 4.0

    LLM agents in controlled network debates show agreement drift toward specific opinion positions, requiring separation of structural effects from LLM biases before using them as human behavioral proxies.

  32. Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

    cs.RO 2026-04 unverdicted novelty 4.0

    This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.

  33. Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

    cs.CL 2026-04 unverdicted novelty 4.0

    EduQwen 32B models optimized via RL then SFT set new SOTA on the Cross-Domain Pedagogical Knowledge Benchmark and surpass Gemini-3 Pro.

  34. Materials Informatics Across the Length Scales

    cond-mat.mtrl-sci 2026-04 unverdicted novelty 2.0

    A survey of data-driven methods for materials modeling at nanoscale, mesoscale, and micro-to-continuum scales that identifies established capabilities, data quality issues, and obstacles to cross-scale integration.

  35. Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems

    eess.SY 2026-04 unverdicted novelty 2.0

    A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

247 extracted references · 247 canonical work pages · cited by 35 Pith papers · 61 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361 , 2020

  2. [2]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al. , “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022

  3. [3]

    Prediction and entropy of printed english,

    C. E. Shannon, “Prediction and entropy of printed english,” Bell system technical journal, vol. 30, no. 1, pp. 50–64, 1951

  4. [4]

    Jelinek, Statistical methods for speech recognition

    F. Jelinek, Statistical methods for speech recognition . MIT press, 1998

  5. [5]

    Manning and H

    C. Manning and H. Schutze, Foundations of statistical natural lan- guage processing. MIT press, 1999

  6. [6]

    C. D. Manning, An introduction to information retrieval . Cambridge university press, 2009

  7. [7]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong et al. , “A survey of large language models,” arXiv preprint arXiv:2303.18223 , 2023

  8. [8]

    A comprehensive survey on pretrained foundation mod- els: A history from bert to chatgpt,

    C. Zhou, Q. Li, C. Li, J. Yu, Y . Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He et al., “A comprehensive survey on pretrained foundation mod- els: A history from bert to chatgpt,” arXiv preprint arXiv:2302.09419, 2023

  9. [9]

    Pre- train, prompt, and predict: A systematic survey of prompting methods in natural language processing,

    P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre- train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys , vol. 55, no. 9, pp. 1–35, 2023

  10. [10]

    A Survey on In-context Learning

    Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022

  11. [11]

    arXiv preprint arXiv:2212.10403 , year=

    J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” arXiv preprint arXiv:2212.10403 , 2022

  12. [12]

    An empirical study of smoothing techniques for language modeling,

    S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language , vol. 13, no. 4, pp. 359–394, 1999

  13. [13]

    A neural probabilistic language model,

    Y . Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” Advances in neural information processing systems , vol. 13, 2000

  14. [14]

    Continuous space language models for statistical machine translation,

    H. Schwenk, D. D ´echelotte, and J.-L. Gauvain, “Continuous space language models for statistical machine translation,” in Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions , 2006, pp. 723–730

  15. [15]

    Recurrent neural network based language model

    T. Mikolov, M. Karafi ´at, L. Burget, J. Cernock `y, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, no. 3. Makuhari, 2010, pp. 1045–1048

  16. [16]

    arXiv preprint arXiv:1308.0850 (2013) 4, 5

    A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850 , 2013

  17. [17]

    Learning deep structured semantic models for web search using clickthrough data,

    P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using clickthrough data,” in Proceedings of the 22nd ACM international conference on Information & Knowledge Management , 2013, pp. 2333–2338

  18. [18]

    J. Gao, C. Xiong, P. Bennett, and N. Craswell, Neural Approaches to Conversational Information Retrieval. Springer Nature, 2023, vol. 44

  19. [19]

    Sequence to sequence learning with neural networks,

    I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014

  20. [20]

    On the properties of neural machine translation: Encoder-decoder approaches.arXiv preprint arXiv:1409.1259, 2014

    K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y . Bengio, “On the properties of neural machine translation: Encoder-decoder ap- proaches,” arXiv preprint arXiv:1409.1259 , 2014

  21. [21]

    From captions to visual concepts and back,

    H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll ´ar, J. Gao, X. He, M. Mitchell, J. C. Platt et al. , “From captions to visual concepts and back,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 1473–1482

  22. [22]

    Show and tell: A neural image caption generator,

    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3156–3164

  23. [23]

    Deep contextualized word representations

    M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations. corr abs/1802.05365 (2018),” arXiv preprint arXiv:1802.05365 , 2018

  24. [24]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

  25. [25]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692 , 2019

  26. [26]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” arXiv preprint arXiv:2006.03654 , 2020

  27. [27]

    Pre-trained models: Past, present and future,

    X. Han, Z. Zhang, N. Ding, Y . Gu, X. Liu, Y . Huo, J. Qiu, Y . Yao, A. Zhang, L. Zhang et al. , “Pre-trained models: Past, present and future,” AI Open, vol. 2, pp. 225–250, 2021

  28. [28]

    Pre-trained models for natural language processing: A survey,

    X. Qiu, T. Sun, Y . Xu, Y . Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020

  29. [29]

    Efficiently modeling long sequences with structured state spaces,

    A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” 2022

  30. [30]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023

  31. [31]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022

  32. [32]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

  33. [33]

    GPT-4 Technical Report,

    OpenAI, “GPT-4 Technical Report,” https://arxiv.org/pdf/2303. 08774v3.pdf, 2023

  34. [34]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837. [Onl...

  35. [35]

    Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

    G. Mialon, R. Dess `ı, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozi `ere, T. Schick, J. Dwivedi-Yu, A. Celikyil- maz et al. , “Augmented language models: a survey,” arXiv preprint arXiv:2302.07842, 2023

  36. [36]

    Check your facts and try again: Improving large language models with external knowledge and automated feedback

    B. Peng, M. Galley, P. He, H. Cheng, Y . Xie, Y . Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao, “Check your facts and try again: Improving large language models with external knowledge and automated feedback,” arXiv preprint arXiv:2302.12813 , 2023

  37. [37]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022

  38. [38]

    Learning internal representations by error propagation,

    D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning internal representations by error propagation,” 1985

  39. [39]

    Finding structure in time,

    J. L. Elman, “Finding structure in time,” Cognitive science , vol. 14, no. 2, pp. 179–211, 1990

  40. [40]

    Fast text compression with neural networks

    M. V . Mahoney, “Fast text compression with neural networks.” in FLAIRS conference, 2000, pp. 230–234

  41. [41]

    Strate- gies for training large scale neural network language models,

    T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. ˇCernock`y, “Strate- gies for training large scale neural network language models,” in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding . IEEE, 2011, pp. 196–201

  42. [42]

    tmikolov. rnnlm. [Online]. Available: https://www.fit.vutbr.cz/ ∼imikolov/rnnlm/

  43. [43]

    Deep learning–based text classification: a comprehensive review,

    S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, “Deep learning–based text classification: a comprehensive review,” ACM computing surveys (CSUR) , vol. 54, no. 3, pp. 1–40, 2021

  44. [44]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

  45. [45]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language represen- tations,” arXiv preprint arXiv:1909.11942 , 2019

  46. [46]

    ELECTRA: Pre-training text encoders as discriminators rather than generators

    K. Clark, M.-T. Luong, Q. V . Le, and C. D. Manning, “Electra: Pre- training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020

  47. [47]

    Cross- lingual language model pretraining

    G. Lample and A. Conneau, “Cross-lingual language model pretrain- ing,” arXiv preprint arXiv:1901.07291 , 2019

  48. [48]

    Xlnet: Generalized autoregressive pretraining for language understanding,

    Z. Yang, Z. Dai, Y . Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V . Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems , vol. 32, 2019

  49. [49]

    Unified language model pre-training for natural language understanding and generation,

    L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y . Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” Advances in neural information processing systems , vol. 32, 2019

  50. [50]

    Improv- ing language understanding by generative pre-training,

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improv- ing language understanding by generative pre-training,” 2018

  51. [51]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

  52. [52]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020

  53. [53]

    mt5: A massively multilingual pre-trained text-to-text transformer

    L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” arXiv preprint arXiv:2010.11934 , 2020

  54. [54]

    MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,

    K. Song, X. Tan, T. Qin, J. Lu, and T.-Y . Liu, “Mass: Masked sequence to sequence pre-training for language generation,” arXiv preprint arXiv:1905.02450, 2019

  55. [55]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461 , 2019

  56. [56]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  57. [57]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- plan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al. , “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

  58. [58]

    WebGPT: Browser-assisted question-answering with human feedback

    R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saunders et al., “Webgpt: Browser- assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021

  59. [59]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al. , “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems , vol. 35, pp. 27 730–27 744, 2022

  60. [60]

    (2022) Introducing chatgpt

    OpenAI. (2022) Introducing chatgpt. [Online]. Available: https: //openai.com/blog/chatgpt

  61. [61]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

  62. [62]

    Alpaca: A strong, replicable instruction- following model,

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction- following model,” Stanford Center for Research on Foundation Mod- els. https://crfm. stanford. edu/2023/03/13/alpaca. html , vol. 3, no. 6, p. 7, 2023

  63. [63]

    QLoRA: Efficient Finetuning of Quantized LLMs

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Ef- ficient finetuning of quantized llms,”arXiv preprint arXiv:2305.14314, 2023

  64. [64]

    Koala: A dialogue model for academic research,

    X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song, “Koala: A dialogue model for academic research,” Blog post, April, vol. 1, 2023

  65. [65]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825 , 2023

  66. [66]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950 , 2023

  67. [67]

    Gorilla: Large language model connected with massive apis,

    S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” 2023

  68. [68]

    Giraffe: Adventures in expanding context lengths in llms,

    A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, and S. Naidu, “Giraffe: Adventures in expanding context lengths in llms,” arXiv preprint arXiv:2308.10882 , 2023

  69. [69]

    Vigogne: French instruction-following and chat models,

    B. Huang, “Vigogne: French instruction-following and chat models,” https://github.com/bofenghuang/vigogne, 2023

  70. [70]

    arXiv preprint arXiv:2306.04751 , year=

    Y . Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R. Chandu, D. Wadden, K. MacMillan, N. A. Smith, I. Beltagy et al., “How far can camels go? exploring the state of instruction tuning on open resources,” arXiv preprint arXiv:2306.04751 , 2023

  71. [71]

    Focused transformer: Contrastive training for context scaling,

    S. Tworkowski, K. Staniszewski, M. Pacek, Y . Wu, H. Michalewski, and P. Miło´s, “Focused transformer: Contrastive training for context scaling,” arXiv preprint arXiv:2307.03170 , 2023

  72. [72]

    Stable beluga models

    D. Mahan, R. Carlow, L. Castricato, N. Cooper, and C. Laforte, “Stable beluga models.” [Online]. Available: [https://huggingface.co/stabilityai/StableBeluga2](https:// huggingface.co/stabilityai/StableBeluga2)

  73. [73]

    Tran, David R

    Y . Tay, J. Wei, H. W. Chung, V . Q. Tran, D. R. So, S. Shakeri, X. Gar- cia, H. S. Zheng, J. Rao, A. Chowdhery et al., “Transcending scaling laws with 0.1% extra compute,” arXiv preprint arXiv:2210.11399 , 2022

  74. [74]

    Scaling Instruction-Finetuned Language Models

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction- finetuned language models,” arXiv preprint arXiv:2210.11416 , 2022

  75. [75]

    PaLM 2 Technical Report

    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al. , “Palm 2 technical report,” arXiv preprint arXiv:2305.10403 , 2023

  76. [76]

    arXiv preprint arXiv:2212.13138 , year=

    K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” arXiv preprint arXiv:2212.13138, 2022

  77. [77]

    Towards expert- level medical question answering with large language models,

    K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al. , “Towards expert- level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023

  78. [78]

    Finetuned Language Models Are Zero-Shot Learners

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652 , 2021

  79. [79]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Younget al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021

  80. [80]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    V . Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al. , “Multi- task prompted training enables zero-shot task generalization,” arXiv preprint arXiv:2110.08207, 2021

Showing first 80 references.