Recognition: 2 theorem links
GLM-5: from Vibe Coding to Agentic Engineering
Pith reviewed 2026-05-11 05:42 UTC · model grok-4.3
The pith
GLM-5 advances from vibe coding to agentic engineering by using asynchronous reinforcement learning to handle complex software tasks more effectively.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, it implements a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, novel asynchronous agent RL algorithms further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks and demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end软件
What carries the argument
Asynchronous reinforcement learning infrastructure that decouples generation from training, paired with DSA to cut costs while retaining long-context fidelity.
If this is right
- Post-training of large models becomes more efficient without loss of long-context ability.
- Models learn more effectively from extended, complex coding interactions.
- Performance on end-to-end software engineering tasks exceeds prior baselines.
- Greater model autonomy supports more complete software development workflows.
Where Pith is reading between the lines
- The same decoupling technique could be tested in non-coding domains that require long-horizon planning.
- Deployment in open-source repositories would reveal whether benchmark gains translate to messy, real projects.
- Future models might combine this infrastructure with multi-agent setups to coordinate larger engineering efforts.
Load-bearing premise
The reported gains in coding performance and efficiency are produced by the asynchronous RL infrastructure and DSA rather than by undisclosed choices in data, scale, or evaluation.
What would settle it
Train a model at similar scale without the asynchronous RL components and compare its results on the same real-world coding benchmarks and end-to-end engineering tasks; equal or better performance would undermine the central claim.
read the original abstract
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce GLM-5, a foundation model transitioning from vibe coding to agentic engineering. It uses DSA to reduce costs while maintaining long-context fidelity, a new asynchronous RL infrastructure decoupling generation from training to improve efficiency, and novel asynchronous agent RL algorithms for better long-horizon learning. These lead to SOTA on open benchmarks and unprecedented real-world coding performance in end-to-end software engineering.
Significance. If the performance claims hold with proper substantiation, the work could have high significance for machine learning and AI agents by demonstrating scalable methods for agentic coding systems, with potential efficiency gains from the proposed RL decoupling and DSA that could impact practical deployment.
major comments (3)
- [Abstract] Abstract: The abstract asserts SOTA performance on major open benchmarks and unprecedented real-world coding capabilities but contains no benchmark numbers, ablation studies, error bars, or methodological details, providing no evidence that the data or methods support the central claims.
- [Methods] Methods: The asynchronous reinforcement learning infrastructure, DSA, and novel async agent RL algorithms are described as the primary drivers of efficiency and performance gains, but the manuscript provides no ablation studies, scaling curves, or controlled comparisons holding data, model size, and training compute fixed while varying only these components.
- [Results] Results: No tables, figures, or quantitative results are presented to demonstrate the claimed SOTA benchmark performance or improvements in real-world end-to-end software engineering tasks, leaving the attribution of gains to the proposed innovations underdetermined.
minor comments (1)
- [Abstract] The term 'vibe coding' is used without definition or reference, which may reduce accessibility for readers.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We agree that the current manuscript draft requires substantial additions to provide quantitative evidence, ablations, and results that substantiate the performance claims. We will revise accordingly to address all major comments. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts SOTA performance on major open benchmarks and unprecedented real-world coding capabilities but contains no benchmark numbers, ablation studies, error bars, or methodological details, providing no evidence that the data or methods support the central claims.
Authors: We acknowledge that the abstract as currently written lacks specific numbers and details. In the revised manuscript, we will expand the abstract to report key benchmark results (such as pass@1 scores on HumanEval, MBPP, and other standard coding benchmarks), quantitative improvements on end-to-end engineering tasks, and concise references to the core methodological contributions. This will immediately ground the claims in evidence. revision: yes
-
Referee: [Methods] Methods: The asynchronous reinforcement learning infrastructure, DSA, and novel async agent RL algorithms are described as the primary drivers of efficiency and performance gains, but the manuscript provides no ablation studies, scaling curves, or controlled comparisons holding data, model size, and training compute fixed while varying only these components.
Authors: We agree that rigorous ablations are necessary to isolate the contributions of the asynchronous RL infrastructure, DSA, and novel agent RL algorithms. The revision will include a new ablation subsection with controlled experiments that vary only these components while holding data, model size, and total compute constant. Scaling curves for efficiency and performance will also be added. revision: yes
-
Referee: [Results] Results: No tables, figures, or quantitative results are presented to demonstrate the claimed SOTA benchmark performance or improvements in real-world end-to-end software engineering tasks, leaving the attribution of gains to the proposed innovations underdetermined.
Authors: We recognize the absence of quantitative results in the current draft. The revised manuscript will contain comprehensive results sections with tables comparing GLM-5 against prior models on open benchmarks, figures showing performance gains and efficiency improvements, and metrics for real-world end-to-end software engineering tasks. Error bars and statistical details will be reported to support attribution of gains. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The provided paper text consists of an abstract and high-level description of GLM-5's architectural features (DSA, asynchronous RL infrastructure, novel agent RL algorithms) and empirical claims of SOTA performance. No equations, derivations, predictions, or first-principles results are present. Consequently, none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.) can be exhibited because there is no derivation chain to inspect. Claims rest on reported benchmarks and real-world tasks rather than any internal reduction to inputs by construction. This is the expected outcome for a model-release paper lacking formal mathematical structure.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
-
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
-
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
-
MolViBench: Evaluating LLMs on Molecular Vibe Coding
MolViBench is the first benchmark designed to evaluate LLMs on generating executable programs for molecular tasks in drug discovery.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
MathDuels: Evaluating LLMs as Problem Posers and Solvers
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
-
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or t...
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.
-
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.
-
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
-
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...
-
ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design
ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing d...
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
Evaluation Awareness in Language Models Has Limited Effect on Behaviour
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
MAIC-UI: Making Interactive Courseware with Generative UI
MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluat...
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
Temporally Extended Mixture-of-Experts Models
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
-
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards
AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
Toward Autonomous Long-Horizon Engineering for ML Research
AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
-
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
-
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
-
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque ...
-
InCoder-32B-Thinking: Industrial Code World Model for Thinking
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
-
An Independent Safety Evaluation of Kimi K2.5
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
-
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multil...
-
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
HISA speeds up fine-grained sparse attention indexers via block-then-token hierarchy, delivering substantial speedups at 64K context with no training and quality matching the original DSA on long-context benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
UserGPT Technical Report
UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while...
-
Learning CLI Agents with Structured Action Credit under Selective Observation
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
-
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
-
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
-
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.
-
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
-
SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
SparseBalance dynamically adjusts sparsity and batches workloads to load-balance sparse attention training, delivering up to 1.33x speedup and 0.46% better long-context performance on LongBench.
-
Agentic Insight Generation in VSM Simulations
A two-step agentic system for extracting insights from VSM simulations achieves up to 86% accuracy with top LLMs by using progressive data discovery and slim context.
Reference graph
Works this paper leans on
- [1]
-
[2]
S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms, 2024
work page 2024
-
[3]
A. Backlund and L. Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025
-
[4]
I. Badertdinov, A. Golubev, M. Nekrashevich, A. Shevtsov, S. Karasik, A. Andriushchenko, M. Trofimova, D. Litvintseva, and B. Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025
-
[5]
Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InACL’25, pages 3639–3664, 2025
work page 2025
-
[6]
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
C. Bandi, B. Hertzberg, G. Boo, T. Polakam, J. Da, S. Hassaan, M. Sharma, A. Park, E. Hernan- dez, D. Rambado, et al. Mcp-atlas: A large-scale benchmark for tool-use competency with real mcp servers.arXiv preprint arXiv:2602.00933, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
V . Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan.τ 2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025
work page internal anchor Pith review arXiv 2025
- [8]
-
[9]
DeepSeek-AI, A. Liu, A. Mei, and et al. Deepseek-v3.2: Pushing the frontier of open large language models, 2025
work page 2025
- [10]
-
[11]
C. Gao, X. Wu, Z. Lin, D. Zhang, and S. Hu. Nextlong: Toward effective long-context training without long documents, 2025
work page 2025
- [12]
-
[13]
F. Gloeckle, B. Y . Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024
-
[14]
Y . Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models. InICLR’23, 2025
work page 2025
- [15]
-
[16]
Y . He, S. Li, J. Liu, Y . Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, Z. Lin, X. Liu, D. Sun, S. Lin, Z. Zheng, X. Zhu, W. Su, and B. Zheng. Chinese simpleqa: A chinese factuality evaluation for large language models, 2024
work page 2024
- [17]
-
[18]
J. Jia, Z. Chen, X. Wu, C. Gao, Z. Lin, D. Zhang, S. Hu, and B. Guo. Entropylong: Effective long-context training via predictive uncertainty, 2025
work page 2025
-
[19]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InICML’23, pages 19274–19286, 2023. 32
work page 2023
-
[21]
J. Li, A. Fang, G. Smyrnis, M. Ivgi, and et al. Datacomp-lm: In search of the next generation of training sets for language models, 2025
work page 2025
- [22]
- [23]
-
[24]
A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
J. Liu, J. Le Tian, V . Daita, Y . Wei, Y . Ding, Y . K. Wang, J. Yang, and L. ZHANG. Repoqa: Evaluating long context code understanding. InFirst Workshop on Long-Context Foundation Models@ ICML 2024
work page 2024
- [28]
- [29]
-
[30]
I. Moshkov, D. Hanley, I. Sorokin, S. Toshniwal, C. Henkel, B. Schifferer, W. Du, and I. Git- man. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025
-
[31]
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021
work page 2021
- [32]
-
[33]
T. Patwardhan, R. Dias, E. Proehl, G. Kim, M. Wang, O. Watkins, S. P. Fishman, M. Aljubeh, P. Thacker, L. Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025
-
[34]
L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Synthetic-2 release: Four million collaboratively generated reasoning traces,
Prime Intellect. Synthetic-2 release: Four million collaboratively generated reasoning traces,
-
[36]
V . Pyatkin, S. Malik, V . Graf, H. Ivison, S. Huang, P. Dasigi, N. Lambert, and H. Hajishirzi. Generalizing verifiable instruction following, 2025
work page 2025
- [37]
-
[38]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: Memory optimizations toward training trillion parameter models, 2020
work page 2020
-
[39]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InCoLM’24, 2024. 33
work page 2024
-
[40]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
V . Sirdeshmukh, K. Deshpande, J. Mols, L. Jin, E.-Y . Cardona, D. Lee, J. Kritz, W. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms, 2025
work page 2025
-
[42]
H. F. Team. Harbor: A framework for evaluating and optimizing agents and models in container environments., 2026
work page 2026
-
[43]
K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [44]
-
[45]
T. T.-B. Team. Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025
work page 2025
-
[46]
Y . Tian, C. Wang, Z. Liu, H. Huang, W. Yu, D. Song, J. Tang, and Y . Guo. Beyond literal mapping: Benchmarking and improving non-literal translation evaluation, 2026
work page 2026
-
[47]
Y . Wang, S. Wang, S. Zhu, F. Fu, X. Liu, X. Xiao, H. Li, J. Li, F. Wu, and B. Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. InASPLOS’25, pages 421–436, 2025
work page 2025
- [48]
-
[49]
J. Wei, N. Karina, H. W. Chung, Y . J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. Measuring short-form factuality in large language models, 2024
work page 2024
-
[50]
J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025
work page internal anchor Pith review arXiv 2025
-
[51]
L.-C. Xiaomi. Mimo-v2-flash technical report, 2026
work page 2026
-
[52]
A. Yang, A. Li, B. Yang, B. Zhang, and et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [53]
-
[54]
S. Yang, J. Kautz, and A. Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InICLR’24, 2024
work page 2024
-
[55]
S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [56]
-
[57]
Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
T. Yuan, Y . Liu, X. Ye, S. Zhang, J. Tan, B. Chen, C. Song, and D. Zhang. Accelerating the training of large language models using efficient activation rematerialization and optimal hybrid parallelism. InUSENIX ATC’24, pages 545–561, 2024. 34
work page 2024
-
[59]
L. Zhang, S. He, C. Zhang, Y . Kang, B. Li, C. Xie, J. Wang, M. Wang, Y . Huang, S. Fu, E. Nallipogu, Q. Lin, Y . Dang, S. Rajmohan, and D. Zhang. Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025
-
[60]
C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma, et al. Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures. InISCA’25, pages 1731–1745, 2025
work page 2025
-
[61]
X. Zhao, Y . Liu, K. Xu, J. Guo, Z. Wang, Y . Sun, X. Kong, Q. Cao, L. Jiang, Z. Wen, Z. Zhang, and J. Zhou. Small leak can sink a great ship–boost rl training on moe with icepop!, Sep 2025
work page 2025
-
[62]
Group Sequence Policy Optimization
C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y . Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
P. Zhou, B. Leon, X. Ying, C. Zhang, Y . Shao, Q. Ye, D. Chong, Z. Jin, C. Xie, M. Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv preprint arXiv:2504.19314, 2025. 35 A Hyper-Parameters Hyper-parameters related to the model architecture of GLM-5 are shown in Table 10. For training, we follow the sett...
-
[64]
If the agent asks for information NOT in the instruction: - Say you don’t remember or don’t have it - Offer alternative information that IS mentioned in the instruction
-
[65]
Sorry, I don’t remember the order ID, can you search for it? My name/email/phone number/zipcode is
Examples: - If asked for order ID (not in instruction): "Sorry, I don’t remember the order ID, can you search for it? My name/email/phone number/zipcode is ..." - If asked for email (not in instruction): "I don’t have my email handy, but I can give you my name and zip code which are..." - Do not repeat the exact instruction in the conversation. Instead, u...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.