Recognition: no theorem link
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Pith reviewed 2026-05-11 08:01 UTC · model grok-4.3
The pith
GLM-4 language models rival or surpass GPT-4 on benchmarks for general ability, reasoning, coding, and Chinese alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The GLM-4 models, trained on ten trillions of tokens and aligned through supervised fine-tuning and human feedback, closely rival or outperform GPT-4 in general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, approach GPT-4-Turbo in instruction following on IFEval, match GPT-4 Turbo and Claude 3 on long context tasks, and outperform GPT-4 on Chinese alignments via AlignBench. The GLM-4 All Tools model can autonomously select and use tools including web browser, Python interpreter, and text-to-image models to complete complex tasks, matching or exceeding GPT-4 All Tools in practical applications.
What carries the argument
The multi-stage post-training process of supervised fine-tuning followed by learning from human feedback, applied after pre-training on massive multilingual token corpora.
Load-bearing premise
The benchmark scores represent authentic model capabilities and are not inflated by test contamination, specific prompt engineering, or incomplete reporting of evaluation details.
What would settle it
Running the open-sourced GLM-4-9B model through the exact same benchmark suites using publicly available evaluation code and comparing the resulting scores to the reported ones.
read the original abstract
We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the ChatGLM family of LLMs, emphasizing the GLM-4 series (GLM-4, GLM-4-Air, GLM-4-9B) pre-trained on 10 trillion tokens (mostly Chinese and English) and aligned via multi-stage SFT and RLHF. It claims GLM-4 rivals/outperforms GPT-4 on general benchmarks (MMLU, GSM8K, MATH, BBH, GPQA, HumanEval), approaches GPT-4-Turbo on IFEval, matches on long-context tasks, and exceeds GPT-4 on Chinese alignment (AlignBench). The GLM-4 All Tools variant is described as capable of autonomous tool use (browser, Python, etc.), matching or surpassing GPT-4 All Tools in practical tasks. Prior models have been open-sourced with significant community adoption.
Significance. If substantiated, the results would be significant for demonstrating a competitive open LLM family, especially in multilingual (Chinese-English) capabilities and tool-augmented reasoning. The open-sourcing of earlier models (ChatGLM-6B generations, GLM-4-9B, etc.) with over 10 million Hugging Face downloads provides a valuable resource for the community and allows partial verification of the development trajectory.
major comments (2)
- [Abstract] Abstract: The central performance claims that GLM-4 closely rivals or outperforms GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, gets close to GPT-4-Turbo on IFEval, matches on long context, and outperforms on AlignBench are presented without any details on training data composition, decontamination, evaluation protocols, error bars, or released model weights. This blocks independent verification and is load-bearing for the paper's main empirical assertions.
- [Abstract] Abstract (GLM-4 All Tools): The claims regarding the GLM-4 All Tools model's performance in autonomously using tools like web browser and Python interpreter to match or surpass GPT-4 All Tools lack specific task definitions, quantitative metrics, or experimental setups, making these practical application results unverifiable.
minor comments (2)
- [Abstract] The phrase 'ten trillions of tokens' should be corrected to 'ten trillion tokens' for proper English usage.
- [Abstract] Standard benchmarks such as MMLU, GSM8K, etc., are mentioned without references; adding citations would improve clarity for readers unfamiliar with them.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the abstract. We have revised the manuscript to improve verifiability by adding explicit references to detailed sections on data, evaluations, and tool-use experiments, while noting limitations on proprietary information.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims that GLM-4 closely rivals or outperforms GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, gets close to GPT-4-Turbo on IFEval, matches on long context, and outperforms on AlignBench are presented without any details on training data composition, decontamination, evaluation protocols, error bars, or released model weights. This blocks independent verification and is load-bearing for the paper's main empirical assertions.
Authors: We appreciate the referee highlighting the need for greater transparency. The abstract is a concise summary, but we have revised it to briefly note the 10-trillion-token pre-training scale and to direct readers to Section 3 for data composition and decontamination details, Section 5 for evaluation protocols (including error bars where reported), and the introduction for model release information. GLM-4-9B weights are publicly available on Hugging Face, supporting partial verification of the trajectory as described. Full proprietary training data composition for the closed GLM-4 model cannot be disclosed, consistent with industry practice for frontier models; we have clarified this distinction to aid readers. revision: partial
-
Referee: [Abstract] Abstract (GLM-4 All Tools): The claims regarding the GLM-4 All Tools model's performance in autonomously using tools like web browser and Python interpreter to match or surpass GPT-4 All Tools lack specific task definitions, quantitative metrics, or experimental setups, making these practical application results unverifiable.
Authors: We agree the abstract description was high-level and have revised it to specify example tasks (e.g., web-based information retrieval queries and Python-based math problem solving), along with success-rate metrics and direct comparisons to GPT-4 All Tools. We now explicitly reference the expanded experimental details, task definitions, and setups in the new Section 6 on tool-use alignment and evaluation, where autonomous decision-making and practical outcomes are quantified. revision: yes
Circularity Check
No circularity: empirical benchmark reporting with no derivation chain
full rationale
The paper is an empirical report on training and evaluating the GLM-4 model family. It states pre-training corpus size, alignment process, and benchmark scores (MMLU, GSM8K, etc.) but contains no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce claims to inputs by construction. All performance assertions rest on external benchmark comparisons rather than any internal mathematical reduction or ansatz smuggling. This is the standard case of a non-circular empirical model release paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard transformer architecture and next-token prediction objective suffice for scaling to trillions of tokens
- domain assumption Multi-stage supervised fine-tuning plus human feedback produces reliable instruction following and tool-use behavior
Forward citations
Cited by 50 Pith papers
-
CHASM: Unveiling Covert Advertisements on Chinese Social Media
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
-
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
-
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
-
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
-
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
-
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than pri...
-
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.
-
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.
-
FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios
FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.
-
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
-
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
-
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
-
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
-
C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment
C-Mining automatically mines high-fidelity Culture Points from raw multilingual text by treating cross-lingual geometric isolation in embeddings as a quantifiable signal for cultural specificity, then uses them to syn...
-
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
-
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
-
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
-
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.
-
Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Visual attention in MLLMs shows inertia that hinders cognitive inference on object relations, addressed by a training-free Inertia-aware Visual Excitation method that selects dynamically emerging tokens and applies an...
-
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
On the Role of Language Representations in Auto-Bidding: Findings and Implications
SemBid injects LLM-encoded Task, History, and Strategy semantics as tokens into offline bidding trajectories and uses self-attention to outperform numerical-only baselines in performance, constraint satisfaction, and ...
-
CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation
CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 i...
-
Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
Theory-grounded authorship metrics show four LLM personalization methods score below calibrated baselines (0.484-0.508 vs. 0.626 floor), exposing a gap hidden by uncalibrated evaluations.
-
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
-
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.
-
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Multi-LLM Token Filtering and Routing for Sequential Recommendation
MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.
-
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...
-
Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems
Multi-agent systems amplify minor stochastic biases into systemic polarization via echo-chamber effects in structured workflows, even with neutral agents.
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models
A new benchmark exposes food-safety gaps in current LLMs and guardrails, and a fine-tuned 4B model is offered as a domain-specific fix.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models
Paraesthesia is an emotion-style dynamic backdoor attack achieving ~99% success rate on instruction and classification tasks across four LLMs while preserving clean performance.
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
-
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
-
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
ReCAPA uses multi-level predictive correction and semantic alignment modules to reduce cascading failures in VLA systems, with new metrics for tracking error propagation and recovery on embodied benchmarks.
-
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.
-
SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
SparseBalance dynamically adjusts sparsity and batches workloads to load-balance sparse attention training, delivering up to 1.33x speedup and 0.46% better long-context performance on LongBench.
-
Disposition Distillation at Small Scale: A Three-Arc Negative Result
Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.
-
MAFIG: Multi-agent Driven Formal Instruction Generation Framework
MAFIG uses a Perception Agent and Emergency Decision Agent plus span-focused local distillation to let lightweight models rapidly generate formal instructions that fix local scheduling failures, achieving over 94% suc...
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.
-
XekRung Technical Report
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
Reference graph
Works this paper leans on
-
[1]
Y . Bai, X. Lv, J. Zhang, Y . He, J. Qi, L. Hou, J. Tang, Y . Dong, and J. Li. Longalign: A recipe for long context alignment of large language models, 2024
work page 2024
-
[2]
Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2023
work page 2023
-
[3]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...
work page 2020
-
[4]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
S. Chen, S. Wong, L. Chen, and Y . Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. 13
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems , 35:16344– 16359, 2022
work page 2022
-
[9]
M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang. Cogview: Mastering text-to-image generation via transformers, 2021
work page 2021
-
[10]
M. Ding, W. Zheng, W. Hong, and J. Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems , 35:16890–16902, 2022
work page 2022
-
[11]
Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 320–335, 2022
work page 2022
-
[12]
Z. Du, A. Zeng, Y . Dong, and J. Tang. Understanding emergent abilities of language models from the loss perspective, 2024
work page 2024
-
[13]
T. GLM. Chatglm-6b: An open bilingual dialogue language model. https://github.com/ THUDM/ChatGLM-6B, 2023
work page 2023
-
[14]
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. In International Conference on Learning Representations, 2021
work page 2021
-
[15]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Zhang, J. Li, B. Xu, Y . Dong, M. Ding, and J. Tang. Cogagent: A visual language model for gui agents, 2023
work page 2023
-
[17]
Z. Hou, Y . Niu, Z. Du, X. Zhang, X. Liu, A. Zeng, Q. Zheng, M. Huang, H. Wang, J. Tang, and Y . Dong. Chatglm-rlhf: Practices of aligning large language models with human feedback, 2024
work page 2024
- [18]
-
[19]
Y . Li, S. Bubeck, R. Eldan, A. D. Giorno, S. Gunasekar, and Y . T. Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023
work page 2023
-
[20]
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Guha...
work page 2023
- [21]
-
[22]
X. Liu, H. Lai, H. Yu, Y . Xu, A. Zeng, Z. Du, P. Zhang, Y . Dong, and J. Tang. Webglm: Towards an efficient web-enhanced question answering system with human preferences. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 4549–4560, 2023
work page 2023
-
[23]
X. Liu, X. Lei, S. Wang, Y . Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y . Xu, W. L. Tam, X. Zhang, L. Sun, H. Wang, J. Zhang, M. Huang, Y . Dong, and J. Tang. Alignbench: Benchmarking chinese alignment of large language models, 2023
work page 2023
-
[24]
X. Liu, X. Song, Y . Dong, and J. Tang. Extensive self-contrast enables feedback-free language model alignment, 2024. 14
work page 2024
-
[25]
X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang. Agentbench: Evaluating llms as agents, 2023
work page 2023
-
[26]
Introducing meta llama 3: The most capable openly available llm to date
Meta. Introducing meta llama 3: The most capable openly available llm to date. https: //ai.meta.com/blog/meta-llama-3/, 2024
work page 2024
- [27]
-
[28]
R. OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023
work page 2023
- [29]
-
[30]
J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023
work page 2023
- [31]
-
[32]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. CoRR, abs/2311.12022, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022
work page internal anchor Pith review arXiv 2022
-
[34]
R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1715–1725, Berlin, Germany, 2016. Association for Computational Linguistics
work page 2016
-
[35]
N. Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[36]
N. Shazeer. Glu variants improve transformer, 2020
work page 2020
-
[37]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane, A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmülle...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[39]
M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. In A. Rogers, J. L. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 202...
work page 2023
-
[40]
G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y . Xu, R. Doherty, E...
work page 2023
-
[41]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023
work page 2023
-
[42]
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, 17 J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Ko...
work page 2023
-
[43]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023
work page 2023
-
[44]
H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei. Deepnet: Scaling transformers to 1,000 layers, 2022
work page 2022
-
[45]
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, J. Xu, B. Xu, J. Li, Y . Dong, M. Ding, and J. Tang. Cogvlm: Visual expert for pretrained language models, 2023
work page 2023
-
[46]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing System...
work page 2022
-
[47]
Effective long-context scaling of foundation models
W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, et al. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023
-
[48]
Y . Xu, X. Liu, X. Liu, Z. Hou, Y . Li, X. Zhang, Z. Wang, A. Zeng, Z. Du, W. Zhao, J. Tang, and Y . Dong. Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline, 2024
work page 2024
-
[49]
F. Yan, H. Mao, C. C.-J. Ji, T. Zhang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Berkeley function calling leaderboard. 2024
work page 2024
-
[50]
Rethinking benchmark and contamination for language models with rephrased samples,
S. Yang, W.-L. Chiang, L. Zheng, J. E. Gonzalez, and I. Stoica. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023
-
[51]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y . Dong, and J. Tang. Agenttuning: Enabling generalized agent abilities for llms, 2023
work page 2023
-
[53]
A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022
work page internal anchor Pith review arXiv 2022
-
[54]
OPT: Open Pre-trained Transformer Language Models
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [55]
-
[56]
arXiv:2309.07045 (2023), https://arxiv.org/abs/2309.07045
Z. Zhang, L. Lei, L. Wu, R. Sun, Y . Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045, 2023
-
[57]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. 18
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [58]
- [59]
-
[60]
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment, 2023
work page 2023
- [61]
-
[62]
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 19
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.