pith. machine review for the scientific record. sign in

arxiv: 2210.02414 · v2 · submitted 2022-10-05 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

GLM-130B: An Open Bilingual Pre-trained Model

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords pre-trained language modelbilingual modellarge language modelsINT4 quantizationopen source modelGLM-130Bbenchmark comparison
0
0 comments X

The pith

GLM-130B, a 130B-parameter bilingual model, outperforms GPT-3 175B on English benchmarks and runs in INT4 on four consumer GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GLM-130B as a 130 billion parameter pre-trained model trained on English and Chinese text. It describes the design choices and training strategies developed to overcome loss spikes and divergence during pre-training at this scale. The resulting model surpasses GPT-3 175B on a range of English benchmarks and exceeds the larger ERNIE TITAN 3.0 on Chinese benchmarks. It further exploits a scaling property to reach INT4 quantization with almost no accuracy drop, allowing inference on modest hardware. The weights, code, and logs are released publicly.

Core claim

GLM-130B is a 130B-parameter bilingual pre-trained model that, after targeted training for stability, delivers higher scores than GPT-3 175B (davinci) across popular English benchmarks and higher scores than ERNIE TITAN 3.0 260B on Chinese benchmarks, while its scaling behavior permits direct INT4 quantization without post-training steps and with negligible loss.

What carries the argument

The training pipeline of design choices and stability strategies that prevent loss spikes and divergence at 130B scale, together with the scaling property that supports lossless INT4 quantization.

Load-bearing premise

The published benchmark scores reflect genuine capability rather than advantages from the bilingual data mixture or overlap with the closed training sets of the comparison models.

What would settle it

Performance on a fresh suite of held-out English and Chinese tasks that were never part of any public training corpus, where GLM-130B loses its reported edge over GPT-3 175B and ERNIE TITAN 3.0.

read the original abstract

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GLM-130B, a 130 billion parameter bilingual (English and Chinese) pre-trained language model. It describes the training process including design choices, efficiency and stability strategies to address loss spikes and divergence, reports significant outperformance over GPT-3 175B (davinci) on English benchmarks (unlike OPT-175B and BLOOM-176B), consistent superiority over ERNIE TITAN 3.0 260B on Chinese benchmarks, and INT4 quantization without post-training that enables inference on affordable consumer GPUs. Model weights, code, and training logs are open-sourced.

Significance. If the performance claims hold under fair and transparent evaluation protocols, the work is significant for releasing an open 100B-scale model that matches or exceeds closed counterparts like GPT-3, demonstrating practical quantization for accessibility, and documenting stability techniques for large-scale pre-training; these elements can accelerate reproducible research in NLP.

major comments (2)
  1. [§5 (Evaluation)] §5 (Evaluation) and associated tables: the outperformance claims over GPT-3 davinci rest on benchmark scores whose fairness cannot be verified because exact English data mixture ratios, n-gram decontamination logs, and per-task few-shot prompts are not supplied; without these the attribution of gains to the stability strategies rather than data differences remains insecure.
  2. [§4 (Training)] §4 (Training): the loss-spike handling and divergence-prevention techniques are presented as central to successful training, yet no ablation studies or quantitative comparisons isolate their contribution to final downstream scores, leaving the causal link to the reported benchmark advantages unestablished.
minor comments (2)
  1. [Abstract] Abstract: the reference to a 'unique scaling property' enabling INT4 quantization should be cross-referenced to the precise equation or figure that defines it.
  2. [§4 (Training)] Ensure all training hyperparameters, data mixture statistics, and statistical significance tests for benchmark differences are consolidated in a single reproducibility table.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful for the referee's insightful comments, which help improve the manuscript's rigor. We respond to each major comment below, making revisions where possible to enhance transparency.

read point-by-point responses
  1. Referee: [§5 (Evaluation)] §5 (Evaluation) and associated tables: the outperformance claims over GPT-3 davinci rest on benchmark scores whose fairness cannot be verified because exact English data mixture ratios, n-gram decontamination logs, and per-task few-shot prompts are not supplied; without these the attribution of gains to the stability strategies rather than data differences remains insecure.

    Authors: We thank the referee for highlighting the need for greater transparency. In the revised manuscript, we will include the exact English data mixture ratios, n-gram decontamination procedures and logs, and the specific per-task few-shot prompts. These additions will permit independent verification of benchmark fairness and help clarify the relative contributions of data and training stability techniques. revision: yes

  2. Referee: [§4 (Training)] §4 (Training): the loss-spike handling and divergence-prevention techniques are presented as central to successful training, yet no ablation studies or quantitative comparisons isolate their contribution to final downstream scores, leaving the causal link to the reported benchmark advantages unestablished.

    Authors: We agree that ablation studies would provide stronger causal evidence. However, performing them at 130B scale would require multiple full pre-training runs at prohibitive computational cost. We instead document the techniques in detail, release the full training logs, and show their immediate stabilizing effects via loss curves. This supplies practical guidance even without exhaustive ablations. revision: no

standing simulated objections not resolved
  • Performing ablation studies at 130B-parameter scale to isolate the downstream impact of loss-spike handling techniques

Circularity Check

0 steps flagged

Empirical pre-training and external benchmarking; no derivation reduces to inputs by construction

full rationale

The manuscript describes architecture choices, training stability techniques (e.g., loss-spike mitigation), and reports benchmark scores against GPT-3, OPT, BLOOM, and ERNIE. No equations or claims equate a 'prediction' to a fitted parameter, nor does any central result rest on a self-citation chain that itself lacks independent verification. All performance assertions are falsifiable via replication on the released weights and public benchmarks; the bilingual data mixture and decontamination steps are presented as engineering decisions rather than derived quantities.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer architecture choices, conventional optimizer settings, and the assumption that the chosen English-Chinese data mixture produces comparable benchmark scores; no new physical or mathematical entities are introduced.

free parameters (2)
  • 130B parameter count
    Design choice to reach GPT-3 scale; not derived from data.
  • training data mixture ratio
    Chosen to balance English and Chinese performance; affects downstream scores.
axioms (1)
  • domain assumption Standard transformer attention and feed-forward blocks suffice for 100B-scale language modeling
    Invoked throughout the training description.

pith-pipeline@v0.9.0 · 5662 in / 1323 out tokens · 46148 ms · 2026-05-14T17:35:52.559049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

  2. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.

  3. PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    PR-MaGIC refines prompts in in-context segmentation via test-time gradient flow from the mask decoder plus top-1 selection, yielding better masks across benchmarks without training.

  4. SAGE: A Service Agent Graph-guided Evaluation Benchmark

    cs.AI 2026-04 unverdicted novelty 7.0

    SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...

  5. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  6. VideoChat: Chat-Centric Video Understanding

    cs.CV 2023-05 conditional novelty 7.0

    VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

  7. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  8. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

    cs.LG 2026-05 unverdicted novelty 6.0

    MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

  9. Understanding the Mechanism of Altruism in Large Language Models

    econ.GN 2026-04 unverdicted novelty 6.0

    A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

  10. EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

    cs.DB 2026-04 unverdicted novelty 6.0

    EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

  11. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

  12. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  13. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    cs.CL 2023-08 unverdicted novelty 6.0

    Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.

  14. Gorilla: Large Language Model Connected with Massive APIs

    cs.CL 2023-05 conditional novelty 6.0

    Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.

  15. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    cs.CL 2023-05 unverdicted novelty 6.0

    Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.

  16. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  17. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    cs.CL 2022-11 unverdicted novelty 6.0

    BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

  18. Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

    cs.CL 2026-04 unverdicted novelty 5.0

    A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over str...

  19. StarCoder: may the source be with you!

    cs.CL 2023-05 accept novelty 5.0

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  20. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    cs.CL 2024-06 unverdicted novelty 3.0

    GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.

  21. A Survey on the Memory Mechanism of Large Language Model based Agents

    cs.AI 2024-04 accept novelty 3.0

    A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

  22. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  23. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 22 Pith papers · 1 internal anchor

  1. [1]

    Xavier Carreras and Lluís Màrquez

    Association for Computational Linguistics, 2021. Xavier Carreras and Lluís Màrquez. Introduction to the conll-2005 shared task: Semantic role labeling. In CoNLL, pp. 152–164, 2005. Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, and Anastasia Shimorina. The 2020 bilingual, bi-directional WebNLG+ s...

  2. [2]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.58. URL https://aclanthology.org/2021.emnlp-main.58. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre-trained models for natural language processing: A survey.Science China Technological Sciences, 63(10): 1872–1897, 2020. Alec Radford, Karthik Nar...

  3. [3]

    yes” or “no

    to yield model predictions for calculating the metrics. The results are shown in Table 6. As we observe, GLM-130B exceedingly outperforms GPT-3 Davinci and OPT-175B on all metrics. Such results accurately align with our discoveries in language modeling experiments and CrowS-Pairs bias evaluation, that GLM-130B has a high quality in both language modeling ...

  4. [4]

    [User]" and

    is a relative position encoding implemented in the form of absolute position encoding, and its core idea is shown in the following equation. (Rmq)⊤(Rnk) = q⊤R⊤ mRnk = q⊤Rn−mk (1) The product of q at position m and k at position n is related to their distance n − m, which reflects the relativity of the position encoding. The definition of R in the above eq...

  5. [5]

    {{trigger ['text']}} ({{allowed_triggers[trigger['event_type']]}})

    (Event Extraction) {{text}} Please write down ALL event arguments related to the trigger "{{trigger ['text']}} ({{allowed_triggers[trigger['event_type']]}})" marked with "[ ]", given the following categories: - {{shuffle(allowed_arguments[trigger['event_type']].values()) | join("\ n- ")}} Answer: ||| {{format_triple(relations, "") | join(" ")}} (Argument ...

  6. [6]

    \n- ")}} what is the relation between

    Given the candidate relations: - {{shuffle(allowed_relations) | join("\n- ")}} what is the relation between "{{relations[triple_idx]['head'][0]}}" and "{{relations[triple_idx]['tail'][0]}}" in the following sentence? {{text}} Answer: ||| {{relations[triple_idx]['relation']}} Nevertheless, existing joint entity and relation extraction datasets have very li...

  7. [7]

    ( X ; Y ; Z )

    (Relation Extraction) Answer the relation between entities in the form of "( X ; Y ; Z )": {{text}} The relation between "{{relations[0]['head']}}" and "{{relations[0][' tail']}}" is: ||| ( {{relations[0]['head']}} ; {{allowed_relations[ relations[0]['relation']]}} ; {{relations[0]['tail']}} ) (Knowledge Slot Filling, Prompt 0) Based on the sentence provi...

  8. [8]

    {{entities[entity_idx]}}

    Based on the fact that "{{entities[entity_idx]}}" is a "{{ entity_types[entity_idx]}}", which verb in the following sentence should it related to? {{text}} Answer: ||| {{verb}} C.3 R ESULT SOURCES FOR GPT-3, BLOOM-176B, AND OPT-175B Here we describe the result sources for GPT-3, BLOOM-176B, and OPT-175B. Other LLMs we may compare are mostly completely clo...

  9. [9]

    We just adopt the original prompts from BIG-bench and use the official implementation to generate priming examples for few-shot evaluation and to calculate the final scores

    datasets of three LLMs are shown in Table 14 and Figure 16. We just adopt the original prompts from BIG-bench and use the official implementation to generate priming examples for few-shot evaluation and to calculate the final scores. C.6 MMLU E VALUATION All results on 57 MMLU (Hendrycks et al., 2021) datasets of GLM-130B and BLOOM 176B are shown in Table...

  10. [10]

    Summarize the following article:

    from GEM generation benchmark (Gehrmann et al., 2021). We select full WebNLG 2020 and the Clean E2E NLG in the test set and randomly select 5000 test examples from WikiLingua following the practice in (Chowdhery et al., 2022). Following the settings in PaLM, the prompt used for the Summarization tasks is “Summarize the following article:” and the prompt u...

  11. [11]

    partial evaluation

    and Winograd273 (Levesque et al., 2012). For Winogender, GPT-3’s results are acquired from OpenAI API, and BLOOM’s 1-shot result is evaluated by ourselves. For Winograd273, since exist- ing works (Brown et al., 2020; Chowdhery et al., 2022) show that 1-shot learning brings almost no improvement, we only test the zero-shot result. Another thing to notice i...

  12. [12]

    answer_given_question_without_options

    in the MIP training, here we choose Natural Questions (Kwiatkowski et al., 2019) and Strat- egyQA (Geva et al., 2021) as the evaluation datasets for CBQA. The results are presented in Table 18. GLM-130B performs relatively poorer on Natural Questions and performs well on StrategyQA. GLM-130B’s underperformance on Natural Questions, we spec- ulate, potenti...

  13. [13]

    Elon Musk

    repository. We adopt the task formulation from promptsource, too. As we can observe, GLM (bi) has much fewer variances and higher performances on all tasks. For some of the tasks (such as CB, MultiRC, RTE, COPA, and BoolQ), GLM-130B can even achieve over 80% accuracy. We also attempted to fine-tune GLM-130B on the SuperGLUE dataset. However, we encountere...