pith. sign in

arxiv: 2309.10305 · v4 · pith:PG4VPBKMnew · submitted 2023-09-19 · 💻 cs.CL

Baichuan 2: Open Large-scale Language Models

Pith reviewed 2026-05-24 06:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsopen sourcemultilingual modelsmodel benchmarkspre-trainingvertical domains7B 13B parameters
0
0 comments X

The pith

Baichuan 2 presents 7B and 13B parameter models trained on 2.6 trillion tokens that match or exceed similar open-source models on public benchmarks and vertical domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Baichuan 2 as a pair of multilingual language models with 7 billion and 13 billion parameters, each trained from scratch on 2.6 trillion tokens. It reports that these models match or outperform other open-source models of comparable size on benchmarks such as MMLU, CMMLU, GSM8K, and HumanEval. The work further shows strong results in specialized fields including medicine and law. Releasing the full pre-training checkpoints is presented as a way for the community to inspect training dynamics directly.

Core claim

Baichuan 2 consists of 7B and 13B parameter multilingual language models trained from scratch on 2.6 trillion tokens. These models match or outperform other open-source models of similar size on public benchmarks including MMLU, CMMLU, GSM8K, and HumanEval, while also delivering strong performance in vertical domains such as medicine and law.

What carries the argument

The Baichuan 2 model series, defined by its 7B/13B parameter counts and 2.6 trillion token training volume, serves as the vehicle for demonstrating competitive open multilingual LLM performance.

If this is right

  • Open-source models of this scale can reach parity with peers on standard language, math, and code tasks.
  • Strong results in medicine and law indicate that pre-training alone can support vertical-domain capability.
  • Releasing pre-training checkpoints enables direct study of how training dynamics produce the reported benchmark scores.
  • Multilingual training on 2.6 trillion tokens supports competitive performance across languages on the tested evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may encourage more researchers to build and compare models in languages other than English.
  • Vertical-domain strength could arise from the composition of the pre-training data rather than later fine-tuning steps.
  • Releasing checkpoints at this scale creates an opportunity to test whether similar training runs at other parameter counts yield proportional gains.

Load-bearing premise

The chosen public benchmarks and vertical-domain tests provide sufficient and unbiased measures of overall model quality.

What would settle it

A new benchmark set with no possible overlap to the training data on which Baichuan 2 scores substantially below comparable open models would challenge the performance claims.

Figures

Figures reproduced from arXiv: 2309.10305 by Aiyuan Yang, Bingning Wang, Bin Xiao, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Liu, Feng Wang, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, Zhiying Wu.

Figure 1
Figure 1. Figure 1: The distribution of different categories of [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The data processing procedure of Baichuan 2’s pre-training data. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pre-training loss of Baichuan 2. The final training loss of Baichuan 2-7B and Baichuan 2-13B are shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The scaling law of Baichuan 2. We trained [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An illustration of Baichuan 2’s RLHF process. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Helpfulness and harmlessness before and after [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The results of intermediary checkpoints of [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: The training loss with and without NormHead [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: The various training loss of small models for [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evaluation results of Baichuan 2-13B and Baichuan 2-7B on different pre-training steps. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
read the original abstract

Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Baichuan 2, a series of 7B and 13B parameter multilingual LLMs trained from scratch on 2.6 trillion tokens. It claims these models match or outperform other open-source models of similar size on public benchmarks including MMLU, CMMLU, GSM8K, and HumanEval, and excel in vertical domains such as medicine and law. The authors state they will release all pre-training model checkpoints.

Significance. If the performance claims hold after verification, the work would provide competitive open multilingual models with reported strengths in Chinese-language and domain-specific tasks, plus the release of checkpoints to support research on training dynamics. The open release itself is a concrete contribution to reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (matching/outperforming on MMLU, CMMLU, GSM8K, HumanEval) are presented without any description of decontamination procedures, benchmark overlap checks, or training-data composition analysis for the 2.6T-token corpus. This is load-bearing because public web-scale data is likely to contain benchmark instances, rendering the reported scores non-diagnostic of generalization without explicit evidence to the contrary.
  2. [Abstract] Abstract: the assertion that Baichuan 2 'excels in vertical domains such as medicine and law' is stated without naming the specific datasets, evaluation protocols, or comparison baselines used for those domains, preventing assessment of whether the claim is supported by the experiments.
minor comments (1)
  1. [Abstract] The abstract could usefully include a pointer to the specific tables or sections that report the benchmark numbers and vertical-domain results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, proposing targeted revisions to strengthen the manuscript while maintaining accuracy.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (matching/outperforming on MMLU, CMMLU, GSM8K, HumanEval) are presented without any description of decontamination procedures, benchmark overlap checks, or training-data composition analysis for the 2.6T-token corpus. This is load-bearing because public web-scale data is likely to contain benchmark instances, rendering the reported scores non-diagnostic of generalization without explicit evidence to the contrary.

    Authors: We agree that explicit discussion of decontamination procedures and benchmark overlap is necessary to support the generalization claims. Section 3 of the manuscript describes the composition and filtering of the 2.6T-token corpus, but a dedicated analysis of n-gram overlaps with the cited benchmarks was not included. In the revised manuscript we will add a new subsection in Section 3 that reports our decontamination steps, including overlap statistics with MMLU, CMMLU, GSM8K and HumanEval, and any mitigation measures taken. This addition will directly substantiate the abstract claims. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that Baichuan 2 'excels in vertical domains such as medicine and law' is stated without naming the specific datasets, evaluation protocols, or comparison baselines used for those domains, preventing assessment of whether the claim is supported by the experiments.

    Authors: The domain-specific results are reported in Section 5, which names the datasets (medical and legal subsets of CMMLU plus additional benchmarks such as MedQA), describes the evaluation protocols, and provides baseline comparisons. To improve clarity we will revise the abstract to name the primary datasets and add a parenthetical reference to Section 5, allowing readers to locate the full protocols and baselines without altering the technical content. revision: yes

Circularity Check

0 steps flagged

Empirical training report with no derivation chain or predictions

full rationale

The paper is a technical report on training Baichuan 2 (7B/13B) from scratch on 2.6T tokens and reporting results on external benchmarks (MMLU, CMMLU, GSM8K, HumanEval, plus vertical domains). No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear anywhere in the manuscript. Central claims rest on observed benchmark numbers from public datasets, which are independent of the paper's internal content. No load-bearing self-citations or ansatzes are invoked to justify any result. This is a standard empirical report whose validity can be checked against external benchmarks; no circularity is present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical entities appear in the abstract. The work is an empirical model-training report whose claims rest on standard LLM scaling practices and benchmark usage whose details are not supplied.

pith-pipeline@v0.9.0 · 5876 in / 1070 out tokens · 21313 ms · 2026-05-24T06:49:16.733057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Brain-LLM Alignment Tracks Training Data, Not Typology

    cs.CL 2026-05 unverdicted novelty 7.0

    Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic...

  2. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  3. TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

    cs.CL 2026-04 unverdicted novelty 7.0

    TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.

  4. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  5. Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

    cs.IR 2026-05 unverdicted novelty 6.0

    TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.

  6. LLM-Agnostic Semantic Representation Attack

    cs.CL 2026-05 unverdicted novelty 6.0

    SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.

  7. HotComment: A Benchmark for Evaluating Popularity of Online Comments

    cs.AI 2026-04 unverdicted novelty 6.0

    HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...

  8. Dynamic Emotion and Personality Profiling for Multimodal Deception Detection

    cs.CL 2026-04 unverdicted novelty 6.0

    A new dataset DDEP and reliability-weighted fusion model Rel-DDEP jointly detect deception, emotion, and personality from multimodal data, reporting F1 gains of 2.53%, 2.66%, and 9.30% over baselines.

  9. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  10. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    cs.CV 2024-01 conditional novelty 6.0

    MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

  11. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  12. ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

    cs.CL 2026-05 unverdicted novelty 5.0

    ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.

  13. FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion

    cs.LG 2026-04 unverdicted novelty 5.0

    FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.

  14. ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

    cs.CR 2025-06 unverdicted novelty 5.0

    ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.

  15. InternLM2 Technical Report

    cs.CL 2024-03 unverdicted novelty 5.0

    InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

  16. TrustLLM: Trustworthiness in Large Language Models

    cs.CL 2024-01 unverdicted novelty 5.0

    TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt...

  17. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  18. DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

    cs.AI 2026-05 unverdicted novelty 4.0

    DuIVRS-2 deploys an LLM-driven IVR pipeline that processes 0.4 million calls per day at 83.9 percent task success rate using FSM-guided augmentation, selective CoT generation, and cooperative policy iteration.

  19. WisdomInterrogatory (LuWen): An Open-Source Legal Large Language Model Technical Report

    cs.CL 2026-04 unverdicted novelty 4.0

    LuWen is a new open-source Chinese legal LLM that outperforms baselines on judgment prediction, judicial exams, summarization, article QA, and decision reasoning through legal-domain adaptation of a general base model.

  20. Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    cs.LG 2024-08 accept novelty 4.0

    The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.

  21. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  22. Yi: Open Foundation Models by 01.AI

    cs.CL 2024-03 unverdicted novelty 4.0

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

  23. MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

    cs.CV 2024-02 unverdicted novelty 4.0

    MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.

  24. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

  25. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 25 Pith papers · 37 internal anchors

  1. [1]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, and Andriy Mulyar. 2023. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub

  4. [4]

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403

  5. [5]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732

  6. [6]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450

  7. [7]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022 a . Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

  8. [8]

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022 b . Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

  9. [9]

    Baichuan. 2023 a . https://github.com/baichuan-inc/Baichuan-13B A 13b large language model developed by baichuan intelligent technology

  10. [10]

    Baichuan. 2023 b . https://github.com/baichuan-inc/Baichuan-7B A large-scale 7b pretraining language model developed by baichuan-inc

  11. [11]

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023 a . Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR

  12. [12]

    Stella Rose Biderman, Hailey Schoelkopf, Quentin G. Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023 b . Pythia: A suite for analyzing large language models across training and scaling. ArXiv, abs/2304.01373

  13. [13]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  14. [14]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond \' e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad B...

  15. [15]

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023)

  16. [16]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

  17. [17]

    Claude. 2023. Conversation with Claude AI assistant

  18. [18]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  19. [19]

    Marta R Costa-Juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672

  20. [20]

    Yiming Cui, Ziqing Yang, and Xin Yao. 2023. https://arxiv.org/abs/2304.08177 Efficient and effective text encoding for chinese llama and alpaca . arXiv preprint arXiv:2304.08177

  21. [21]

    Tri Dao. 2023. Flash A ttention-2: Faster attention with better parallelism and work partitioning

  22. [22]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems

  23. [23]

    Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933--941. PMLR

  24. [24]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232--5270

  25. [25]

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. https://doi.org/10.5281/zenodo.5371628 A framework for few-shot language model evaluation

  26. [26]

    Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc'Aurelio Ranzato, Francisco Guzm\' a n, and Angela Fan. 2021. The flores-101 evaluation benchmark for low-resource and multilingual machine translation

  27. [27]

    Francisco Guzm\' a n, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc'Aurelio Ranzato. 2019. Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english

  28. [28]

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509

  29. [29]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021 a . Measuring massive multitask language understanding. In ICLR . OpenReview.net

  30. [30]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021 b . Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

  31. [31]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, and et al. Scott Gray. 2020. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701

  32. [32]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556

  33. [33]

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. 2023. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322

  34. [34]

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. http://arxiv.org/abs/2307.04657 Beavertails: Towards improved safety alignment of llm via a human-preference dataset

  35. [35]

    Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. 2023 a . Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258

  36. [36]

    Zixuan Jiang, Jiaqi Gu, and David Z Pan. 2023 b . Normsoftmax: Normalizing the input of softmax to accelerate and stabilize training. In 2023 IEEE International Conference on Omni-layer Intelligent Systems (COINS), pages 1--6. IEEE

  37. [37]

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421

  38. [38]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

  39. [39]

    Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226

  40. [40]

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2023. http://arxiv.org/abs/2306.09212 Cmmlu: Measuring massive multitask language understanding in chinese

  41. [41]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

  42. [42]

    MosaicML. 2023. www.mosaicml.com/blog/mpt-7b Introducing mpt-7b: A new standard for open-source, commercially usable llms

  43. [43]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Network...

  44. [44]

    Xiaonan Nie, Xupeng Miao, Zhi Yang, and Bin Cui. 2022. Tsplit: Fine-grained gpu memory management for efficient dnn training via tensor splitting. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 2615--2628. IEEE

  45. [45]

    OpenAI. 2022. Introducing chatgpt. Blog post openai.com/blog/chatgpt

  46. [46]

    OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774

  47. [47]

    OpenCompass. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/InternLM/OpenCompass

  48. [48]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744

  49. [49]

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. https://proceedings.mlr.press/v174/pal22a.html Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering . In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248--260. PMLR

  50. [50]

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. http://arxiv.org/abs/2306.01116 The R efined W eb dataset for F alcon LLM : outperforming curated corpora with web data, and web data only . arXiv preprint arXiv:2306.01116

  51. [51]

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. corr abs/1802.05365 (2018). arXiv preprint arXiv:1802.05365

  52. [52]

    Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409

  53. [53]

    Markus N Rabe and Charles Staats. 2021. Self-attention does not need o(n^ 2 ) memory. arXiv preprint arXiv:2112.05682

  54. [54]

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training

  55. [55]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290

  56. [56]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE

  57. [57]

    Teven Le Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick, Suzana Ili'c, Daniel Hesslow, Roman Castagn'e, Alexandra Sasha Luccioni, Franccois Yvon, Matthias Gall \'e , Jonathan Tow, Alexander M. Rush, Stella Rose Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno \^i t Sagot, Niklas Muennighoff, Albert Villanova del Moral, O...

  58. [58]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  59. [59]

    Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202

  60. [60]

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. 2022. Language models are multilingual chain-of-thought reasoners. CoRR, abs/2210.03057

  61. [61]

    Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, and Setsuo Arikawa. 1999. Byte pair encoding: A text compression scheme that accelerates pattern matching

  62. [62]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615

  63. [63]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864

  64. [64]

    Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Hang Yan, Xiangyang Liu, Yunfan Shao, Qiong Tang, Xingjian Zhao, Ke Chen, Yining Zheng, Zhejian Zhou, Ruixiao Li, Jun Zhan, Yunhua Zhou, Linyang Li, Xiaogui Yang, Lingling Wu, Zhangyue Yin, Xuanjing Huang, and Xipeng Qiu. 2023. Moss: Training conversational language models from synthetic data

  65. [65]

    Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261

  66. [66]

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7

  67. [67]

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A large language model for science. CoRR, abs/2211.09085

  68. [68]

    Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. 2022. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274--38290

  69. [70]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023 b . Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  70. [71]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023 c . Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  71. [72]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , pages 5998--6008

  72. [73]

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560

  73. [74]

    Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966

  74. [75]

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524--10533. PMLR

  75. [76]

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions

  76. [77]

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414

  77. [78]

    Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32

  78. [79]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068

  79. [80]

    Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. 2023. Evaluating the performance of large language models on gaokao benchmark

  80. [81]

    Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. Jec-qa: A legal-domain question answering dataset. In Proceedings of AAAI

Showing first 80 references.