Recognition: 2 theorem links
· Lean TheoremLarge Language Models: A Survey
Pith reviewed 2026-05-11 15:17 UTC · model grok-4.3
The pith
Large language models acquire general-purpose understanding by training billions of parameters on massive text data as predicted by scaling laws.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws. The paper surveys prominent models from the GPT, LLaMA, and PaLM families, discusses techniques for constructing and augmenting LLMs, reviews training and evaluation datasets along with common metrics, compares performance on benchmarks, and identifies open challenges.
What carries the argument
Scaling laws relating model performance to parameter count and training data volume, which the paper uses to frame the review of LLM families and their development.
If this is right
- Techniques for building and augmenting LLMs can be applied to improve performance on specific downstream tasks.
- Benchmark comparisons highlight which model families excel in particular areas of language processing.
- Discussion of limitations points to concrete areas where future model development should focus.
- Overview of datasets and metrics provides a basis for consistent evaluation across new models.
Where Pith is reading between the lines
- The rapid changes in the field may require periodic updates to the survey to maintain relevance for practitioners.
- Insights on model limitations could guide efforts to create more efficient versions that use fewer resources while retaining capabilities.
- Connections between scaling and emergent abilities suggest testing whether further increases in size produce qualitatively new behaviors beyond current benchmarks.
Load-bearing premise
The selection of prominent LLMs and representative benchmarks accurately reflects the field's current state without major omissions or bias.
What would settle it
A new model family or benchmark set that was omitted from the survey but shows substantially different performance patterns or violates the scaling predictions on the same tasks.
read the original abstract
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript surveys large language models (LLMs), noting that their general-purpose language understanding and generation capabilities arise from training billions of parameters on massive text corpora in line with scaling laws. It reviews prominent LLM families (GPT, LLaMA, PaLM), their characteristics, contributions, and limitations; overviews techniques for building and augmenting LLMs; surveys datasets for training, fine-tuning, and evaluation; reviews evaluation metrics; compares several LLMs on representative benchmarks; and discusses open challenges and future directions.
Significance. If the summaries remain faithful to the cited sources, the survey provides a structured entry point into the post-ChatGPT LLM literature. Its value lies in consolidating model families, techniques, datasets, metrics, and benchmark results into one document, which can help researchers track the field's rapid evolution without needing to consult dozens of primary papers. The explicit linkage to scaling laws and the inclusion of performance comparisons add practical utility for both newcomers and specialists.
major comments (1)
- [Abstract and Introduction] Abstract and §1 (Introduction): the selection of 'some of the most prominent LLMs' and the specific families (GPT, LLaMA, PaLM) plus benchmarks is presented without explicit inclusion/exclusion criteria or a justification of coverage breadth. This choice directly affects the survey's representativeness and risks author-specific bias, which is load-bearing for a descriptive review whose central contribution is organizational completeness.
minor comments (3)
- [References] Ensure that all cited works (e.g., Kaplan et al. 2020, Hoffmann et al. 2022) are listed in the bibliography with complete and consistent formatting, including arXiv identifiers or DOIs where applicable.
- [Evaluation section] Benchmark comparison tables would benefit from an explicit statement of the evaluation date or model versions used, given the rapid release cadence of new LLMs.
- [Figures and Tables] Figure captions and table legends should be expanded to be self-contained, specifying what each column/row represents without requiring reference to the main text.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the recommendation for minor revision. The feedback on selection criteria is constructive and we address it directly below.
read point-by-point responses
-
Referee: [Abstract and Introduction] Abstract and §1 (Introduction): the selection of 'some of the most prominent LLMs' and the specific families (GPT, LLaMA, PaLM) plus benchmarks is presented without explicit inclusion/exclusion criteria or a justification of coverage breadth. This choice directly affects the survey's representativeness and risks author-specific bias, which is load-bearing for a descriptive review whose central contribution is organizational completeness.
Authors: We agree that the absence of explicit inclusion/exclusion criteria reduces transparency. In the revised version we will insert a new paragraph at the end of §1 that states our selection rationale: we focus on three families that (i) exemplify distinct development paradigms (closed-source scaling in GPT, open-source accessibility in LLaMA, and efficient large-scale training in PaLM), (ii) have been cited extensively in the post-ChatGPT literature, and (iii) together cover the dominant architectural and training choices discussed in the survey. Benchmarks were chosen as those most frequently reported across the cited primary papers for core capabilities (reasoning, knowledge, instruction following). We explicitly note that the survey is not exhaustive and that many other models exist; the chosen set is intended to illustrate representative trends rather than to claim completeness. This addition directly mitigates the risk of perceived author-specific bias while preserving the survey's scope. revision: yes
Circularity Check
No significant circularity: survey of external literature only
full rationale
This paper is explicitly a survey that organizes and summarizes existing external work on LLMs (GPT, LLaMA, PaLM families), techniques, datasets, metrics, and benchmarks. Its central statements cite scaling laws to Kaplan et al. (2020) and Hoffmann et al. (2022) with no self-citation load-bearing on any claim. No equations, new predictions, fitted parameters, or derivations appear; the text frames all content as review rather than novel technical assertion. No step reduces by construction to the paper's own inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 35 Pith papers
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
PrivacyAssist: A User-Centric Agent Framework for Detecting Privacy Inconsistencies in Android Apps
PrivacyAssist uses multi-agent LLMs and RAG to detect mismatches between Android app permissions and declared data practices, finding only 16% of 2,347 apps fully consistent.
-
Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider
The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.
-
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
-
Cross-Modal Bayesian Low-Rank Adaptation for Uncertainty-Aware Multimodal Learning
CALIBER conditions the variational posterior of low-rank adapters on token-level cross-attention between text and audio to produce uncertainty-aware multimodal parameter-efficient fine-tuning.
-
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
-
Unified Compression Algorithm for Distributed Nonconvex Optimization: Generalized to 1-Bit, Saturation, and Bounded Noise
A unified compression algorithm for distributed nonconvex optimization achieves O(1/sqrt(T)) convergence for locally-bounded compressors, matching centralized 1-bit methods, with an improved O(1/T^{2/3}) rate after on...
-
Mechanism Design for Quality-Preserving LLM Advertising
A quality-preserving auction framework for LLM advertising uses RAG-based endogenous reserves and KL-regularized or screened VCG mechanisms to achieve DSIC, IR, higher revenue, and better semantic fidelity than baselines.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems
Hesitator is a theory-grounded simulator that separates utility-based item selection from overload-aware commitment decisions to reduce unrealistic high acceptance rates in conversational recommender evaluations.
-
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
-
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
-
Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation
CoCo-LoRA uses audio context to modulate uncertainty in Bayesian low-rank adapters for multimodal text tasks, offering a lightweight alternative to feature fusion that matches or exceeds baselines.
-
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning
ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
-
PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection
PRISM-MCTS improves MCTS-based reasoning efficiency by maintaining a shared memory of heuristics and fallacies reinforced by a process reward model, halving required trajectories on GPQA while outperforming prior methods.
-
SAM 3D: 3Dfy Anything in Images
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
-
Context Convergence Improves Answering Inferential Questions
Passages made from high-convergence sentences improve LLM performance on inferential questions compared to cosine similarity selection.
-
Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Model
Expert re-annotations of a German ABSA dataset serve as ground truth to evaluate how students, crowdworkers, and LLMs affect inter-annotator agreement and downstream performance on ACSA and TASD tasks using BERT, T5, ...
-
Revisiting General Map Search via Generative Point-of-Interest Retrieval
GenPOI is a generative POI retrieval system that unifies heterogeneous contexts via LLMs, uses geo-semantic tokenization, and applies proximity constraints to achieve superior performance on large-scale map search data.
-
Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI
Activation-aware pruning preserves perplexity but amplifies bias in LLMs, with 47-59% of previously neutral items developing new stereotypical responses at 70% sparsity.
-
A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations
A survey that introduces a unified training pipeline and taxonomizes split learning approaches for LLM fine-tuning across model, system, and privacy dimensions.
-
Automated LTL Specification Generation from Industrial Aerospace Requirements
AeroReq2LTL automates LTL generation from industrial aerospace requirements via LLMs with a data dictionary and templates, achieving 85% precision and 88% recall on real data.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework
Quantum-inspired 1024-D document embeddings exhibit weak, unstable ranking performance and structural geometric limitations, performing better as auxiliary components in hybrid lexical-embedding retrieval systems.
-
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.
-
Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach
LLM-based augmentation of the minority class in a Bangla fake news dataset, using high rates and random subsampling, improves F1 score from 0.85 to 0.88.
-
LLM-Enhanced Topical Trend Detection at Snapchat
Snapchat's deployed system detects emerging topical trends in short videos via multimodal extraction, time-series burst detection, and LLM consolidation, achieving high precision per six months of human evaluation and...
-
An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation
A two-stage hybrid search pipeline paired with a synthetic-data fine-tuned and compressed Ukrainian language model delivers competitive local question answering under strict compute limits.
-
Enhancing Mental Health Counseling Support in Bangladesh using Culturally-Grounded Knowledge
A clinically validated knowledge graph built for Bangladeshi stressors and interventions improves LLM counseling responses over standard RAG in contextual relevance and clinical appropriateness.
-
Network Effects and Agreement Drift in LLM Debates
LLM agents in controlled network debates show agreement drift toward specific opinion positions, requiring separation of structural effects from LLM biases before using them as human behavioral proxies.
-
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.
-
Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning
EduQwen 32B models optimized via RL then SFT set new SOTA on the Cross-Domain Pedagogical Knowledge Benchmark and surpass Gemini-3 Pro.
-
Materials Informatics Across the Length Scales
A survey of data-driven methods for materials modeling at nanoscale, mesoscale, and micro-to-continuum scales that identifies established capabilities, data quality issues, and obstacles to cross-scale integration.
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
Reference graph
Works this paper leans on
-
[1]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361 , 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[2]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al. , “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Prediction and entropy of printed english,
C. E. Shannon, “Prediction and entropy of printed english,” Bell system technical journal, vol. 30, no. 1, pp. 50–64, 1951
work page 1951
-
[4]
Jelinek, Statistical methods for speech recognition
F. Jelinek, Statistical methods for speech recognition . MIT press, 1998
work page 1998
-
[5]
C. Manning and H. Schutze, Foundations of statistical natural lan- guage processing. MIT press, 1999
work page 1999
-
[6]
C. D. Manning, An introduction to information retrieval . Cambridge university press, 2009
work page 2009
-
[7]
A Survey of Large Language Models
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong et al. , “A survey of large language models,” arXiv preprint arXiv:2303.18223 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
A comprehensive survey on pretrained foundation mod- els: A history from bert to chatgpt,
C. Zhou, Q. Li, C. Li, J. Yu, Y . Liu, G. Wang, K. Zhang, C. Ji, Q. Yan, L. He et al., “A comprehensive survey on pretrained foundation mod- els: A history from bert to chatgpt,” arXiv preprint arXiv:2302.09419, 2023
-
[9]
P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre- train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys , vol. 55, no. 9, pp. 1–35, 2023
work page 2023
-
[10]
A Survey on In-context Learning
Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022
work page internal anchor Pith review arXiv 2022
-
[11]
arXiv preprint arXiv:2212.10403 , year=
J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” arXiv preprint arXiv:2212.10403 , 2022
-
[12]
An empirical study of smoothing techniques for language modeling,
S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language , vol. 13, no. 4, pp. 359–394, 1999
work page 1999
-
[13]
A neural probabilistic language model,
Y . Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” Advances in neural information processing systems , vol. 13, 2000
work page 2000
-
[14]
Continuous space language models for statistical machine translation,
H. Schwenk, D. D ´echelotte, and J.-L. Gauvain, “Continuous space language models for statistical machine translation,” in Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions , 2006, pp. 723–730
work page 2006
-
[15]
Recurrent neural network based language model
T. Mikolov, M. Karafi ´at, L. Burget, J. Cernock `y, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, no. 3. Makuhari, 2010, pp. 1045–1048
work page 2010
-
[16]
arXiv preprint arXiv:1308.0850 (2013) 4, 5
A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850 , 2013
-
[17]
Learning deep structured semantic models for web search using clickthrough data,
P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using clickthrough data,” in Proceedings of the 22nd ACM international conference on Information & Knowledge Management , 2013, pp. 2333–2338
work page 2013
-
[18]
J. Gao, C. Xiong, P. Bennett, and N. Craswell, Neural Approaches to Conversational Information Retrieval. Springer Nature, 2023, vol. 44
work page 2023
-
[19]
Sequence to sequence learning with neural networks,
I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014
work page 2014
-
[20]
K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y . Bengio, “On the properties of neural machine translation: Encoder-decoder ap- proaches,” arXiv preprint arXiv:1409.1259 , 2014
-
[21]
From captions to visual concepts and back,
H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll ´ar, J. Gao, X. He, M. Mitchell, J. C. Platt et al. , “From captions to visual concepts and back,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 1473–1482
work page 2015
-
[22]
Show and tell: A neural image caption generator,
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3156–3164
work page 2015
-
[23]
Deep contextualized word representations
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations. corr abs/1802.05365 (2018),” arXiv preprint arXiv:1802.05365 , 2018
work page Pith review arXiv 2018
-
[24]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[26]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” arXiv preprint arXiv:2006.03654 , 2020
work page internal anchor Pith review arXiv 2006
-
[27]
Pre-trained models: Past, present and future,
X. Han, Z. Zhang, N. Ding, Y . Gu, X. Liu, Y . Huo, J. Qiu, Y . Yao, A. Zhang, L. Zhang et al. , “Pre-trained models: Past, present and future,” AI Open, vol. 2, pp. 225–250, 2021
work page 2021
-
[28]
Pre-trained models for natural language processing: A survey,
X. Qiu, T. Sun, Y . Xu, Y . Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020
work page 2020
-
[29]
Efficiently modeling long sequences with structured state spaces,
A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” 2022
work page 2022
-
[30]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
OpenAI, “GPT-4 Technical Report,” https://arxiv.org/pdf/2303. 08774v3.pdf, 2023
work page 2023
-
[34]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 24 824–24 837. [Onl...
work page 2022
-
[35]
Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023
G. Mialon, R. Dess `ı, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozi `ere, T. Schick, J. Dwivedi-Yu, A. Celikyil- maz et al. , “Augmented language models: a survey,” arXiv preprint arXiv:2302.07842, 2023
-
[36]
B. Peng, M. Galley, P. He, H. Cheng, Y . Xie, Y . Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao, “Check your facts and try again: Improving large language models with external knowledge and automated feedback,” arXiv preprint arXiv:2302.12813 , 2023
-
[37]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Learning internal representations by error propagation,
D. E. Rumelhart, G. E. Hinton, R. J. Williams et al., “Learning internal representations by error propagation,” 1985
work page 1985
-
[39]
J. L. Elman, “Finding structure in time,” Cognitive science , vol. 14, no. 2, pp. 179–211, 1990
work page 1990
-
[40]
Fast text compression with neural networks
M. V . Mahoney, “Fast text compression with neural networks.” in FLAIRS conference, 2000, pp. 230–234
work page 2000
-
[41]
Strate- gies for training large scale neural network language models,
T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. ˇCernock`y, “Strate- gies for training large scale neural network language models,” in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding . IEEE, 2011, pp. 196–201
work page 2011
-
[42]
tmikolov. rnnlm. [Online]. Available: https://www.fit.vutbr.cz/ ∼imikolov/rnnlm/
-
[43]
Deep learning–based text classification: a comprehensive review,
S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, “Deep learning–based text classification: a comprehensive review,” ACM computing surveys (CSUR) , vol. 54, no. 3, pp. 1–40, 2021
work page 2021
-
[44]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017
work page 2017
-
[45]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language represen- tations,” arXiv preprint arXiv:1909.11942 , 2019
work page internal anchor Pith review arXiv 1909
-
[46]
ELECTRA: Pre-training text encoders as discriminators rather than generators
K. Clark, M.-T. Luong, Q. V . Le, and C. D. Manning, “Electra: Pre- training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020
-
[47]
Cross- lingual language model pretraining
G. Lample and A. Conneau, “Cross-lingual language model pretrain- ing,” arXiv preprint arXiv:1901.07291 , 2019
-
[48]
Xlnet: Generalized autoregressive pretraining for language understanding,
Z. Yang, Z. Dai, Y . Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V . Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems , vol. 32, 2019
work page 2019
-
[49]
Unified language model pre-training for natural language understanding and generation,
L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y . Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” Advances in neural information processing systems , vol. 32, 2019
work page 2019
-
[50]
Improv- ing language understanding by generative pre-training,
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improv- ing language understanding by generative pre-training,” 2018
work page 2018
-
[51]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019
work page 2019
-
[52]
Exploring the limits of transfer learning with a unified text-to-text transformer,
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020
work page 2020
-
[53]
mt5: A massively multilingual pre-trained text-to-text transformer
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” arXiv preprint arXiv:2010.11934 , 2020
-
[54]
K. Song, X. Tan, T. Qin, J. Lu, and T.-Y . Liu, “Mass: Masked sequence to sequence pre-training for language generation,” arXiv preprint arXiv:1905.02450, 2019
-
[55]
M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461 , 2019
work page internal anchor Pith review arXiv 1910
-
[56]
Language mod- els are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[57]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- plan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al. , “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[58]
WebGPT: Browser-assisted question-answering with human feedback
R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saunders et al., “Webgpt: Browser- assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[59]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al. , “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems , vol. 35, pp. 27 730–27 744, 2022
work page 2022
-
[60]
OpenAI. (2022) Introducing chatgpt. [Online]. Available: https: //openai.com/blog/chatgpt
work page 2022
-
[61]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Alpaca: A strong, replicable instruction- following model,
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction- following model,” Stanford Center for Research on Foundation Mod- els. https://crfm. stanford. edu/2023/03/13/alpaca. html , vol. 3, no. 6, p. 7, 2023
work page 2023
-
[63]
QLoRA: Efficient Finetuning of Quantized LLMs
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Ef- ficient finetuning of quantized llms,”arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Koala: A dialogue model for academic research,
X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song, “Koala: A dialogue model for academic research,” Blog post, April, vol. 1, 2023
work page 2023
-
[65]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Code Llama: Open Foundation Models for Code
B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Gorilla: Large language model connected with massive apis,
S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” 2023
work page 2023
-
[68]
Giraffe: Adventures in expanding context lengths in llms,
A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, and S. Naidu, “Giraffe: Adventures in expanding context lengths in llms,” arXiv preprint arXiv:2308.10882 , 2023
-
[69]
Vigogne: French instruction-following and chat models,
B. Huang, “Vigogne: French instruction-following and chat models,” https://github.com/bofenghuang/vigogne, 2023
work page 2023
-
[70]
arXiv preprint arXiv:2306.04751 , year=
Y . Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R. Chandu, D. Wadden, K. MacMillan, N. A. Smith, I. Beltagy et al., “How far can camels go? exploring the state of instruction tuning on open resources,” arXiv preprint arXiv:2306.04751 , 2023
-
[71]
Focused transformer: Contrastive training for context scaling,
S. Tworkowski, K. Staniszewski, M. Pacek, Y . Wu, H. Michalewski, and P. Miło´s, “Focused transformer: Contrastive training for context scaling,” arXiv preprint arXiv:2307.03170 , 2023
-
[72]
D. Mahan, R. Carlow, L. Castricato, N. Cooper, and C. Laforte, “Stable beluga models.” [Online]. Available: [https://huggingface.co/stabilityai/StableBeluga2](https:// huggingface.co/stabilityai/StableBeluga2)
-
[73]
Y . Tay, J. Wei, H. W. Chung, V . Q. Tran, D. R. So, S. Shakeri, X. Gar- cia, H. S. Zheng, J. Rao, A. Chowdhery et al., “Transcending scaling laws with 0.1% extra compute,” arXiv preprint arXiv:2210.11399 , 2022
-
[74]
Scaling Instruction-Finetuned Language Models
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction- finetuned language models,” arXiv preprint arXiv:2210.11416 , 2022
work page internal anchor Pith review arXiv 2022
-
[75]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al. , “Palm 2 technical report,” arXiv preprint arXiv:2305.10403 , 2023
work page internal anchor Pith review arXiv 2023
-
[76]
arXiv preprint arXiv:2212.13138 , year=
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” arXiv preprint arXiv:2212.13138, 2022
-
[77]
Towards expert- level medical question answering with large language models,
K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al. , “Towards expert- level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023
-
[78]
Finetuned Language Models Are Zero-Shot Learners
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[79]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Younget al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021
work page internal anchor Pith review arXiv 2021
-
[80]
Multitask Prompted Training Enables Zero-Shot Task Generalization
V . Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al. , “Multi- task prompted training enables zero-shot task generalization,” arXiv preprint arXiv:2110.08207, 2021
work page internal anchor Pith review arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.