pith. machine review for the scientific record. sign in

arxiv: 2406.12793 · v2 · submitted 2024-06-18 · 💻 cs.CL

Recognition: no theorem link

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Bin Xu, Bowen Wang, Chenhui Zhang, Dan Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jingyu Sun, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Team GLM: Aohan Zeng, Weng Lam Tam, Wenyi Zhao, Xiaohan Zhang, Xiao Liu, Xiaotao Gu, Xiao Xia, Xinghan Liu, Xin Lv, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhengxiao Du, Zhen Yang, Zhenyu Hou, Zihan Wang

Pith reviewed 2026-05-11 08:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsGLM-4benchmark evaluationmodel alignmenttool useChinese language processing
0
0 comments X

The pith

GLM-4 language models rival or surpass GPT-4 on benchmarks for general ability, reasoning, coding, and Chinese alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the development of the ChatGLM family of large language models, with a focus on the GLM-4 series including GLM-4, GLM-4-Air, and GLM-4-9B. These models are pre-trained on ten trillions of tokens primarily in Chinese and English and undergo a multi-stage alignment process. Evaluations indicate that GLM-4 performs at or above GPT-4 levels on several key metrics while also supporting advanced tool use in the All Tools variant. This suggests the availability of competitive alternatives to closed-source frontier models, especially for Chinese-language applications.

Core claim

The GLM-4 models, trained on ten trillions of tokens and aligned through supervised fine-tuning and human feedback, closely rival or outperform GPT-4 in general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, approach GPT-4-Turbo in instruction following on IFEval, match GPT-4 Turbo and Claude 3 on long context tasks, and outperform GPT-4 on Chinese alignments via AlignBench. The GLM-4 All Tools model can autonomously select and use tools including web browser, Python interpreter, and text-to-image models to complete complex tasks, matching or exceeding GPT-4 All Tools in practical applications.

What carries the argument

The multi-stage post-training process of supervised fine-tuning followed by learning from human feedback, applied after pre-training on massive multilingual token corpora.

Load-bearing premise

The benchmark scores represent authentic model capabilities and are not inflated by test contamination, specific prompt engineering, or incomplete reporting of evaluation details.

What would settle it

Running the open-sourced GLM-4-9B model through the exact same benchmark suites using publicly available evaluation code and comparing the resulting scores to the reported ones.

read the original abstract

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the ChatGLM family of LLMs, emphasizing the GLM-4 series (GLM-4, GLM-4-Air, GLM-4-9B) pre-trained on 10 trillion tokens (mostly Chinese and English) and aligned via multi-stage SFT and RLHF. It claims GLM-4 rivals/outperforms GPT-4 on general benchmarks (MMLU, GSM8K, MATH, BBH, GPQA, HumanEval), approaches GPT-4-Turbo on IFEval, matches on long-context tasks, and exceeds GPT-4 on Chinese alignment (AlignBench). The GLM-4 All Tools variant is described as capable of autonomous tool use (browser, Python, etc.), matching or surpassing GPT-4 All Tools in practical tasks. Prior models have been open-sourced with significant community adoption.

Significance. If substantiated, the results would be significant for demonstrating a competitive open LLM family, especially in multilingual (Chinese-English) capabilities and tool-augmented reasoning. The open-sourcing of earlier models (ChatGLM-6B generations, GLM-4-9B, etc.) with over 10 million Hugging Face downloads provides a valuable resource for the community and allows partial verification of the development trajectory.

major comments (2)
  1. [Abstract] Abstract: The central performance claims that GLM-4 closely rivals or outperforms GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, gets close to GPT-4-Turbo on IFEval, matches on long context, and outperforms on AlignBench are presented without any details on training data composition, decontamination, evaluation protocols, error bars, or released model weights. This blocks independent verification and is load-bearing for the paper's main empirical assertions.
  2. [Abstract] Abstract (GLM-4 All Tools): The claims regarding the GLM-4 All Tools model's performance in autonomously using tools like web browser and Python interpreter to match or surpass GPT-4 All Tools lack specific task definitions, quantitative metrics, or experimental setups, making these practical application results unverifiable.
minor comments (2)
  1. [Abstract] The phrase 'ten trillions of tokens' should be corrected to 'ten trillion tokens' for proper English usage.
  2. [Abstract] Standard benchmarks such as MMLU, GSM8K, etc., are mentioned without references; adding citations would improve clarity for readers unfamiliar with them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the abstract. We have revised the manuscript to improve verifiability by adding explicit references to detailed sections on data, evaluations, and tool-use experiments, while noting limitations on proprietary information.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims that GLM-4 closely rivals or outperforms GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, gets close to GPT-4-Turbo on IFEval, matches on long context, and outperforms on AlignBench are presented without any details on training data composition, decontamination, evaluation protocols, error bars, or released model weights. This blocks independent verification and is load-bearing for the paper's main empirical assertions.

    Authors: We appreciate the referee highlighting the need for greater transparency. The abstract is a concise summary, but we have revised it to briefly note the 10-trillion-token pre-training scale and to direct readers to Section 3 for data composition and decontamination details, Section 5 for evaluation protocols (including error bars where reported), and the introduction for model release information. GLM-4-9B weights are publicly available on Hugging Face, supporting partial verification of the trajectory as described. Full proprietary training data composition for the closed GLM-4 model cannot be disclosed, consistent with industry practice for frontier models; we have clarified this distinction to aid readers. revision: partial

  2. Referee: [Abstract] Abstract (GLM-4 All Tools): The claims regarding the GLM-4 All Tools model's performance in autonomously using tools like web browser and Python interpreter to match or surpass GPT-4 All Tools lack specific task definitions, quantitative metrics, or experimental setups, making these practical application results unverifiable.

    Authors: We agree the abstract description was high-level and have revised it to specify example tasks (e.g., web-based information retrieval queries and Python-based math problem solving), along with success-rate metrics and direct comparisons to GPT-4 All Tools. We now explicitly reference the expanded experimental details, task definitions, and setups in the new Section 6 on tool-use alignment and evaluation, where autonomous decision-making and practical outcomes are quantified. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark reporting with no derivation chain

full rationale

The paper is an empirical report on training and evaluating the GLM-4 model family. It states pre-training corpus size, alignment process, and benchmark scores (MMLU, GSM8K, etc.) but contains no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce claims to inputs by construction. All performance assertions rest on external benchmark comparisons rather than any internal mathematical reduction or ansatz smuggling. This is the standard case of a non-circular empirical model release paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The report relies on standard large-language-model training and evaluation practices without introducing new theoretical constructs, free parameters, or invented entities.

axioms (2)
  • domain assumption Standard transformer architecture and next-token prediction objective suffice for scaling to trillions of tokens
    Invoked implicitly by describing pre-training on ten trillions of tokens
  • domain assumption Multi-stage supervised fine-tuning plus human feedback produces reliable instruction following and tool-use behavior
    Stated as the alignment method for GLM-4

pith-pipeline@v0.9.0 · 5925 in / 1395 out tokens · 53197 ms · 2026-05-11T08:01:06.427306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CHASM: Unveiling Covert Advertisements on Chinese Social Media

    cs.LG 2026-04 unverdicted novelty 8.0

    CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

  2. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

    cs.CR 2026-04 unverdicted novelty 8.0

    DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

  3. PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

    cs.RO 2026-05 unverdicted novelty 7.0

    PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.

  4. K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

    cs.CL 2026-05 conditional novelty 7.0

    K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.

  5. Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

  6. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  7. Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    cs.CL 2026-05 unverdicted novelty 7.0

    TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.

  8. Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving

    cs.OS 2026-05 unverdicted novelty 7.0

    Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than pri...

  9. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.

  10. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

  11. FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

    cs.CL 2026-05 unverdicted novelty 7.0

    FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.

  12. From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

    cs.LG 2026-05 unverdicted novelty 7.0

    AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...

  13. ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...

  14. EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.

  15. Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.

  16. C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

    cs.CL 2026-04 unverdicted novelty 7.0

    C-Mining automatically mines high-fidelity Culture Points from raw multilingual text by treating cross-lingual geometric isolation in embeddings as a quantifiable signal for cultural specificity, then uses them to syn...

  17. TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

    cs.CL 2026-04 unverdicted novelty 7.0

    TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.

  18. How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

    cs.AI 2026-04 unverdicted novelty 7.0

    Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.

  19. Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

    cs.CL 2026-04 conditional novelty 7.0

    SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

  20. SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

    cs.CR 2026-04 unverdicted novelty 7.0

    SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.

  21. Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual attention in MLLMs shows inertia that hinders cognitive inference on object relations, addressed by a training-free Inertia-aware Visual Excitation method that selects dynamically emerging tokens and applies an...

  22. When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.

  23. UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

    cs.CL 2026-05 unverdicted novelty 6.0

    UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.

  24. On the Role of Language Representations in Auto-Bidding: Findings and Implications

    cs.AI 2026-05 unverdicted novelty 6.0

    SemBid injects LLM-encoded Task, History, and Strategy semantics as tokens into offline bidding trajectories and uses self-attention to outperform numerical-only baselines in performance, constraint satisfaction, and ...

  25. CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 i...

  26. Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization

    cs.CL 2026-04 unverdicted novelty 6.0

    Theory-grounded authorship metrics show four LLM personalization methods score below calibrated baselines (0.484-0.508 vs. 0.626 floor), exposing a gap hidden by uncalibrated evaluations.

  27. Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

    cs.AR 2026-04 unverdicted novelty 6.0

    Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

  28. CAP: Controllable Alignment Prompting for Unlearning in LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.

  29. CAP: Controllable Alignment Prompting for Unlearning in LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.

  30. LLM Safety From Within: Detecting Harmful Content with Internal Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

  31. Multi-LLM Token Filtering and Routing for Sequential Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    MLTFR combines user-guided token filtering with a multi-LLM mixture-of-experts and Fisher-weighted consensus expert to deliver stable gains in corpus-free sequential recommendation.

  32. MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

    cs.AR 2026-04 unverdicted novelty 6.0

    MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

  33. The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

    cs.CR 2026-04 unverdicted novelty 6.0

    Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

  34. Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...

  35. Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 6.0

    Multi-agent systems amplify minor stochastic biases into systemic polarization via echo-chamber effects in structured workflows, even with neutral agents.

  36. The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

    cs.CR 2026-04 unverdicted novelty 6.0

    ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

  37. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  38. Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

    cs.CR 2026-04 conditional novelty 6.0

    A new benchmark exposes food-safety gaps in current LLMs and guardrails, and a fine-tuned 4B model is offered as a domain-specific fix.

  39. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  40. When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    Paraesthesia is an emotion-style dynamic backdoor attack achieving ~99% success rate on instruction and classification tasks across four LLMs while preserving clean performance.

  41. Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    cs.CL 2026-05 unverdicted novelty 5.0

    TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.

  42. From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.

  43. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

    cs.AI 2026-04 unverdicted novelty 5.0

    ReCAPA uses multi-level predictive correction and semantic alignment modules to reduce cascading failures in VLA systems, with new metrics for tracking error propagation and recovery on embodied benchmarks.

  44. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

    cs.AI 2026-04 unverdicted novelty 5.0

    ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.

  45. SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

    cs.LG 2026-04 unverdicted novelty 5.0

    SparseBalance dynamically adjusts sparsity and batches workloads to load-balance sparse attention training, delivering up to 1.33x speedup and 0.46% better long-context performance on LongBench.

  46. Disposition Distillation at Small Scale: A Three-Arc Negative Result

    cs.LG 2026-04 accept novelty 5.0

    Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.

  47. MAFIG: Multi-agent Driven Formal Instruction Generation Framework

    cs.AI 2026-04 unverdicted novelty 5.0

    MAFIG uses a Perception Agent and Emergency Decision Agent plus span-focused local distillation to let lightweight models rapidly generate formal instructions that fix local scheduling failures, achieving over 94% suc...

  48. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  49. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    cs.CL 2024-03 unverdicted novelty 4.0

    LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.

  50. XekRung Technical Report

    cs.CR 2026-04 unverdicted novelty 3.0

    XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 46 Pith papers · 15 internal anchors

  1. [1]

    Y . Bai, X. Lv, J. Zhang, Y . He, J. Qi, L. Hou, J. Tang, Y . Dong, and J. Li. Longalign: A recipe for long context alignment of large language models, 2024

  2. [2]

    Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2023

  3. [3]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

  4. [4]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

  5. [5]

    S. Chen, S. Wong, L. Chen, and Y . Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023

  6. [6]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  7. [7]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. 13

  8. [8]

    T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems , 35:16344– 16359, 2022

  9. [9]

    M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang. Cogview: Mastering text-to-image generation via transformers, 2021

  10. [10]

    M. Ding, W. Zheng, W. Hong, and J. Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems , 35:16890–16902, 2022

  11. [11]

    Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 320–335, 2022

  12. [12]

    Z. Du, A. Zeng, Y . Dong, and J. Tang. Understanding emergent abilities of language models from the loss perspective, 2024

  13. [13]

    T. GLM. Chatglm-6b: An open bilingual dialogue language model. https://github.com/ THUDM/ChatGLM-6B, 2023

  14. [14]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. In International Conference on Learning Representations, 2021

  15. [15]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

  16. [16]

    W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Zhang, J. Li, B. Xu, Y . Dong, M. Ding, and J. Tang. Cogagent: A visual language model for gui agents, 2023

  17. [17]

    Z. Hou, Y . Niu, Z. Du, X. Zhang, X. Liu, A. Zeng, Q. Zheng, M. Huang, H. Wang, J. Tang, and Y . Dong. Chatglm-rlhf: Practices of aligning large language models with human feedback, 2024

  18. [18]

    H. Lai, X. Liu, I. L. Iong, S. Yao, Y . Chen, P. Shen, H. Yu, H. Zhang, X. Zhang, Y . Dong, et al. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648, 2024

  19. [19]

    Y . Li, S. Bubeck, R. Eldan, A. D. Giorno, S. Gunasekar, and Y . T. Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023

  20. [20]

    Liang, R

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Guha...

  21. [21]

    M. Liu, A. Zeng, B. Wang, P. Zhang, J. Tang, and Y . Dong. Apar: Llms can do auto-parallel auto-regressive decoding. ArXiv, abs/2401.06761, 2024

  22. [22]

    X. Liu, H. Lai, H. Yu, Y . Xu, A. Zeng, Z. Du, P. Zhang, Y . Dong, and J. Tang. Webglm: Towards an efficient web-enhanced question answering system with human preferences. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 4549–4560, 2023

  23. [23]

    X. Liu, X. Lei, S. Wang, Y . Huang, Z. Feng, B. Wen, J. Cheng, P. Ke, Y . Xu, W. L. Tam, X. Zhang, L. Sun, H. Wang, J. Zhang, M. Huang, Y . Dong, and J. Tang. Alignbench: Benchmarking chinese alignment of large language models, 2023

  24. [24]

    X. Liu, X. Song, Y . Dong, and J. Tang. Extensive self-contrast enables feedback-free language model alignment, 2024. 14

  25. [25]

    X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang. Agentbench: Evaluating llms as agents, 2023

  26. [26]

    Introducing meta llama 3: The most capable openly available llm to date

    Meta. Introducing meta llama 3: The most capable openly available llm to date. https: //ai.meta.com/blog/meta-llama-3/, 2024

  27. [27]

    tiktoken

    OpenAI. tiktoken. https://github.com/openai/tiktoken, 2023

  28. [28]

    R. OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023

  29. [29]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

  30. [30]

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023

  31. [31]

    Press, N

    O. Press, N. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations , 2022

  32. [32]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. CoRR, abs/2311.12022, 2023

  33. [33]

    T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022

  34. [34]

    Sennrich, B

    R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1715–1725, Berlin, Germany, 2016. Association for Computational Linguistics

  35. [35]

    N. Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

  36. [36]

    N. Shazeer. Glu variants improve transformer, 2020

  37. [37]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane, A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmülle...

  38. [38]

    J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021

  39. [39]

    Suzgun, N

    M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. In A. Rogers, J. L. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 202...

  40. [40]

    G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y . Xu, R. Doherty, E...

  41. [41]

    Touvron, T

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023

  42. [42]

    Touvron, L

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, 17 J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Ko...

  43. [43]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023

  44. [44]

    H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei. Deepnet: Scaling transformers to 1,000 layers, 2022

  45. [45]

    W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, J. Xu, B. Xu, J. Li, Y . Dong, M. Ding, and J. Tang. Cogvlm: Visual expert for pretrained language models, 2023

  46. [46]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing System...

  47. [47]

    Effective long-context scaling of foundation models

    W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, et al. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023

  48. [48]

    Y . Xu, X. Liu, X. Liu, Z. Hou, Y . Li, X. Zhang, Z. Wang, A. Zeng, Z. Du, W. Zhao, J. Tang, and Y . Dong. Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline, 2024

  49. [49]

    F. Yan, H. Mao, C. C.-J. Ji, T. Zhang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Berkeley function calling leaderboard. 2024

  50. [50]

    Rethinking benchmark and contamination for language models with rephrased samples,

    S. Yang, W.-L. Chiang, L. Zheng, J. E. Gonzalez, and I. Stoica. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023

  51. [51]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  52. [52]

    A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y . Dong, and J. Tang. Agenttuning: Enabling generalized agent abilities for llms, 2023

  53. [53]

    A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022

  54. [54]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

  55. [55]

    Zhang, H

    S. Zhang, H. Zhao, X. Liu, Q. Zheng, Z. Qi, X. Gu, X. Zhang, Y . Dong, and J. Tang. Natural- codebench: Examining coding performance mismatch on humaneval and natural user prompts. arXiv preprint arXiv:2405.04520, 2024

  56. [56]

    arXiv:2309.07045 (2023), https://arxiv.org/abs/2309.07045

    Z. Zhang, L. Lei, L. Wu, R. Sun, Y . Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045, 2023

  57. [57]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. 18

  58. [58]

    Zheng, X

    Q. Zheng, X. Xia, X. Zou, Y . Dong, S. Wang, Y . Xue, Z. Wang, L. Shen, A. Wang, Y . Li, T. Su, Z. Yang, and J. Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023

  59. [59]

    Zheng, J

    W. Zheng, J. Teng, Z. Yang, W. Wang, J. Chen, X. Gu, Y . Dong, M. Ding, and J. Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion, 2024

  60. [60]

    C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment, 2023

  61. [61]

    J. Zhou, Z. Chen, D. Wan, B. Wen, Y . Song, J. Yu, Y . Huang, L. Peng, J. Yang, X. Xiao, et al. Characterglm: Customizing chinese conversational ai characters with large language models. arXiv preprint arXiv:2311.16832, 2023

  62. [62]

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 19