pith. machine review for the scientific record. sign in

arxiv: 2602.15763 · v2 · submitted 2026-02-17 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

GLM-5: from Vibe Coding to Agentic Engineering

Bin Chen, Bin Xu, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chendi Ge, Chenghua Huang, Chengwei Hu, Chengxing Xie, Chenhui Zhang, Chen Li, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Dan Zhang, Daoyan Lin, Da Yin, Dayong Yang, Ding Ai, Di Wang, Erle Zhu, Fangzhou Yi, Feiyu Chen, Gengzheng Pan, GLM-5-Team: Aohan Zeng, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Haobo Zhang, Haoke Zhang, Hao Peng, Haoran Wang, Hao Tai, Hao Zeng, He Liu, Hongning Wang, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huanpeng Chu, Huilong Chen, Jiachen Wang, Jiajie Zhang, Jiajing Zhao, Jiamin Ren, Jia'ni Zhao, Jian Jiao, Jiapeng Wang, Jiaqi Guo, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jie Tang, Jijie Li, Jing An, Jing Li, Jingsen Wang, Jingwei Yuan, Jingzhao Du, Jinhua Du, Jinxin Liu, Jinzhu Wu, Juanzi Li, Junkai Zhi, Junwen Duan, Kaiyue Zhou, Kangjian Wei, Kedong Wang, Ke Wang, Keyun Luo, Laiqiang Zhang, Leigang Sha, Lei Li, Liang Xu, Lindong Wu, Lin Fan, Lintao Ding, Lucen Zhong, Lu Chen, Mingdao Liu, Minghao Li, Mingming Zhao, Minlie Huang, Nianyi Lin, Pan Ta, Pengfan Du, Qian Dong, Qiang Zou, Qinkai Zheng, Rongjun Song, Rui Lu, Ruiqi Yang, Shangqing Tu, Shangtong Yang, Shaoxiang Wu, Shengyan Zhang, Shijie Li, Shuang Li, Shuang-Li, Shulin Cao, Shuyi Fan, Song Liu, Ting Jiang, Weining Zhang, Wei Qin, Wei Tian, Wenbo Yu, Wenjie Liang, Xiang Kuang, Xiangmeng Cheng, Xiangyang Li, Xiaodong Chen, Xiaohan Zhang, Xiaoquan Yan, Xiaowei Hu, Xiaoying Ling, Xing Fan, Xingye Xia, Xin Lv, Xinyuan Zhang, Xinze Zhang, Xirui Pan, Xuancheng Huang, Xuezhen Dong, Xunkai Zhang, Xu Zou, Yabo Xu, Yadi Liu, Yandong Wu, Yanfu Li, Yao Wei, Yidong Wang, Yifan An, Yifan Zhu, Yijun Tan, Yilin Niu, Yilin Zhou, Yiming Pan, Ying Zhang, Yinpei Su, Yipeng Geng, Yitong Zhu, Yonglin Tan, Yong Yan, Yuanhao Wen, Yuean Bi, Yuhan Shen, Yuhao Yang, Yujiang Li, Yukuo Cen, Yunan Liu, Yunqing Wang, Yuntao Li, Yurong Wu, Yushi Bai, Yutao Zhang, Yuxiao Dong, Yuxi Duan, Yuxuan Zhang, Zezhen Liu, Zhengtao Jiang, Zhengxiao Du, Zhenhe Yan, Zhenyu Hou, Zheyu Zhang, Zhixiang Wei, Zhongpei Qiao, Zhuo Chen, Zhuoer Feng, Zihan Wang, Zijun Yao, Zikang Wang, Zilin Zhu, Ziqiang Liu, Ziwei Chai, Zixuan Li, Ziyuan Wang, Zuzhou Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords GLM-5agentic engineeringasynchronous reinforcement learningcoding modelsfoundation modelssoftware engineeringRL infrastructure
0
0 comments X

The pith

GLM-5 advances from vibe coding to agentic engineering by using asynchronous reinforcement learning to handle complex software tasks more effectively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GLM-5 as a foundation model designed to transition from vibe coding to agentic engineering. It adopts DSA to reduce training and inference costs while preserving long-context fidelity. A new asynchronous reinforcement learning infrastructure decouples generation from training to raise post-training efficiency. Novel asynchronous agent RL algorithms are added to improve learning from long-horizon interactions. These steps produce state-of-the-art results on open benchmarks and strong gains on real-world end-to-end software engineering challenges.

Core claim

GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, it implements a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, novel asynchronous agent RL algorithms further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks and demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end软件

What carries the argument

Asynchronous reinforcement learning infrastructure that decouples generation from training, paired with DSA to cut costs while retaining long-context fidelity.

If this is right

  • Post-training of large models becomes more efficient without loss of long-context ability.
  • Models learn more effectively from extended, complex coding interactions.
  • Performance on end-to-end software engineering tasks exceeds prior baselines.
  • Greater model autonomy supports more complete software development workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling technique could be tested in non-coding domains that require long-horizon planning.
  • Deployment in open-source repositories would reveal whether benchmark gains translate to messy, real projects.
  • Future models might combine this infrastructure with multi-agent setups to coordinate larger engineering efforts.

Load-bearing premise

The reported gains in coding performance and efficiency are produced by the asynchronous RL infrastructure and DSA rather than by undisclosed choices in data, scale, or evaluation.

What would settle it

Train a model at similar scale without the asynchronous RL components and compare its results on the same real-world coding benchmarks and end-to-end engineering tasks; equal or better performance would undermine the central claim.

read the original abstract

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims to introduce GLM-5, a foundation model transitioning from vibe coding to agentic engineering. It uses DSA to reduce costs while maintaining long-context fidelity, a new asynchronous RL infrastructure decoupling generation from training to improve efficiency, and novel asynchronous agent RL algorithms for better long-horizon learning. These lead to SOTA on open benchmarks and unprecedented real-world coding performance in end-to-end software engineering.

Significance. If the performance claims hold with proper substantiation, the work could have high significance for machine learning and AI agents by demonstrating scalable methods for agentic coding systems, with potential efficiency gains from the proposed RL decoupling and DSA that could impact practical deployment.

major comments (3)
  1. [Abstract] Abstract: The abstract asserts SOTA performance on major open benchmarks and unprecedented real-world coding capabilities but contains no benchmark numbers, ablation studies, error bars, or methodological details, providing no evidence that the data or methods support the central claims.
  2. [Methods] Methods: The asynchronous reinforcement learning infrastructure, DSA, and novel async agent RL algorithms are described as the primary drivers of efficiency and performance gains, but the manuscript provides no ablation studies, scaling curves, or controlled comparisons holding data, model size, and training compute fixed while varying only these components.
  3. [Results] Results: No tables, figures, or quantitative results are presented to demonstrate the claimed SOTA benchmark performance or improvements in real-world end-to-end software engineering tasks, leaving the attribution of gains to the proposed innovations underdetermined.
minor comments (1)
  1. [Abstract] The term 'vibe coding' is used without definition or reference, which may reduce accessibility for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the current manuscript draft requires substantial additions to provide quantitative evidence, ablations, and results that substantiate the performance claims. We will revise accordingly to address all major comments. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts SOTA performance on major open benchmarks and unprecedented real-world coding capabilities but contains no benchmark numbers, ablation studies, error bars, or methodological details, providing no evidence that the data or methods support the central claims.

    Authors: We acknowledge that the abstract as currently written lacks specific numbers and details. In the revised manuscript, we will expand the abstract to report key benchmark results (such as pass@1 scores on HumanEval, MBPP, and other standard coding benchmarks), quantitative improvements on end-to-end engineering tasks, and concise references to the core methodological contributions. This will immediately ground the claims in evidence. revision: yes

  2. Referee: [Methods] Methods: The asynchronous reinforcement learning infrastructure, DSA, and novel async agent RL algorithms are described as the primary drivers of efficiency and performance gains, but the manuscript provides no ablation studies, scaling curves, or controlled comparisons holding data, model size, and training compute fixed while varying only these components.

    Authors: We agree that rigorous ablations are necessary to isolate the contributions of the asynchronous RL infrastructure, DSA, and novel agent RL algorithms. The revision will include a new ablation subsection with controlled experiments that vary only these components while holding data, model size, and total compute constant. Scaling curves for efficiency and performance will also be added. revision: yes

  3. Referee: [Results] Results: No tables, figures, or quantitative results are presented to demonstrate the claimed SOTA benchmark performance or improvements in real-world end-to-end software engineering tasks, leaving the attribution of gains to the proposed innovations underdetermined.

    Authors: We recognize the absence of quantitative results in the current draft. The revised manuscript will contain comprehensive results sections with tables comparing GLM-5 against prior models on open benchmarks, figures showing performance gains and efficiency improvements, and metrics for real-world end-to-end software engineering tasks. Error bars and statistical details will be reported to support attribution of gains. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The provided paper text consists of an abstract and high-level description of GLM-5's architectural features (DSA, asynchronous RL infrastructure, novel agent RL algorithms) and empirical claims of SOTA performance. No equations, derivations, predictions, or first-principles results are present. Consequently, none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.) can be exhibited because there is no derivation chain to inspect. Claims rest on reported benchmarks and real-world tasks rather than any internal reduction to inputs by construction. This is the expected outcome for a model-release paper lacking formal mathematical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no technical equations, training details, or derivations, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 6192 in / 1136 out tokens · 43598 ms · 2026-05-11T05:42:53.299362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    cs.CL 2026-05 unverdicted novelty 8.0

    A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

  3. Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

    cs.AI 2026-05 unverdicted novelty 8.0

    Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

  4. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

  5. OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

    cs.CL 2026-04 unverdicted novelty 8.0

    OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...

  6. AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.

  7. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

    cs.CY 2026-05 unverdicted novelty 7.0

    StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.

  8. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

    cs.CY 2026-05 accept novelty 7.0

    StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

  9. Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.

  10. Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.

  11. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  12. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

    cs.LG 2026-05 conditional novelty 7.0

    MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

  13. Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

    cs.SE 2026-05 unverdicted novelty 7.0

    TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.

  14. MolViBench: Evaluating LLMs on Molecular Vibe Coding

    cs.CL 2026-05 unverdicted novelty 7.0

    MolViBench is the first benchmark designed to evaluate LLMs on generating executable programs for molecular tasks in drug discovery.

  15. When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

  16. MathDuels: Evaluating LLMs as Problem Posers and Solvers

    cs.CL 2026-04 unverdicted novelty 7.0

    Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.

  17. Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

    cs.SE 2026-04 unverdicted novelty 7.0

    MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or t...

  18. BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.

  19. ClawBench: Can AI Agents Complete Everyday Online Tasks?

    cs.CL 2026-04 unverdicted novelty 7.0

    ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.

  20. AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.

  21. DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

    cs.DC 2026-04 unverdicted novelty 7.0

    DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.

  22. MinT: Managed Infrastructure for Training and Serving Millions of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.

  23. Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

  24. SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

    cs.SE 2026-05 unverdicted novelty 6.0

    SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

  25. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  26. Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

    cs.LG 2026-05 unverdicted novelty 6.0

    Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...

  27. ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design

    cs.LG 2026-05 unverdicted novelty 6.0

    ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing d...

  28. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.

  29. Flow-OPD: On-Policy Distillation for Flow Matching Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.

  30. LoopTrap: Termination Poisoning Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.

  31. Evaluation Awareness in Language Models Has Limited Effect on Behaviour

    cs.CL 2026-05 conditional novelty 6.0

    Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

  32. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

  33. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 6.0

    Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...

  34. Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.

  35. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  36. MAIC-UI: Making Interactive Courseware with Generative UI

    cs.CL 2026-04 unverdicted novelty 6.0

    MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluat...

  37. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  38. Temporally Extended Mixture-of-Experts Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.

  39. AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

    cs.CV 2026-04 unverdicted novelty 6.0

    AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.

  40. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...

  41. Toward Autonomous Long-Horizon Engineering for ML Research

    cs.CL 2026-04 unverdicted novelty 6.0

    AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.

  42. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    cs.LG 2026-04 unverdicted novelty 6.0

    On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.

  43. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.

  44. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.

  45. Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque ...

  46. InCoder-32B-Thinking: Industrial Code World Model for Thinking

    cs.AR 2026-04 unverdicted novelty 6.0

    InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

  47. An Independent Safety Evaluation of Kimi K2.5

    cs.CR 2026-04 conditional novelty 6.0

    Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

  48. From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

    cs.SE 2026-04 unverdicted novelty 6.0

    A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multil...

  49. HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

    cs.LG 2026-03 unverdicted novelty 6.0

    HISA speeds up fine-grained sparse attention indexers via block-then-token hierarchy, delivering substantial speedups at 64K context with no training and quality matching the original DSA on long-context benchmarks.

  50. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.

  51. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

  52. UserGPT Technical Report

    cs.IR 2026-05 unverdicted novelty 5.0

    UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while...

  53. Learning CLI Agents with Structured Action Credit under Selective Observation

    cs.AI 2026-05 unverdicted novelty 5.0

    CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.

  54. PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

    cs.LG 2026-05 unverdicted novelty 5.0

    PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...

  55. Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

    cs.DC 2026-05 unverdicted novelty 5.0

    Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

  56. GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

    cs.CV 2026-04 unverdicted novelty 5.0

    GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.

  57. Reasoning Primitives in Hybrid and Non-Hybrid LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.

  58. MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.

  59. SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

    cs.LG 2026-04 unverdicted novelty 5.0

    SparseBalance dynamically adjusts sparsity and batches workloads to load-balance sparse attention training, delivering up to 1.33x speedup and 0.46% better long-context performance on LongBench.

  60. Agentic Insight Generation in VSM Simulations

    cs.CL 2026-04 unverdicted novelty 5.0

    A two-step agentic system for extracting insights from VSM simulations achieves up to 86% accuracy with top LLMs by using progressive data discovery and slim context.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 61 Pith papers · 14 internal anchors

  1. [1]

    System card: Claude opus 4.5, 2025

    Anthropic. System card: Claude opus 4.5, 2025

  2. [2]

    Ashkboos, A

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms, 2024

  3. [3]

    Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840,

    A. Backlund and L. Petersson. Vending-bench: A benchmark for long-term coherence of autonomous agents.arXiv preprint arXiv:2502.15840, 2025

  4. [4]

    Swe- rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

    I. Badertdinov, A. Golubev, M. Nekrashevich, A. Shevtsov, S. Karasik, A. Andriushchenko, M. Trofimova, D. Litvintseva, and B. Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411, 2025

  5. [5]

    Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InACL’25, pages 3639–3664, 2025

  6. [6]

    MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

    C. Bandi, B. Hertzberg, G. Boo, T. Polakam, J. Da, S. Hassaan, M. Sharma, A. Park, E. Hernan- dez, D. Rambado, et al. Mcp-atlas: A large-scale benchmark for tool-use competency with real mcp servers.arXiv preprint arXiv:2602.00933, 2026

  7. [7]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    V . Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan.τ 2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

  8. [8]

    DeepMind

    G. DeepMind. Gemini 3 pro model card, 2025

  9. [9]

    DeepSeek-AI, A. Liu, A. Mei, and et al. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

  10. [10]

    W. Du, S. Toshniwal, B. Kisacanin, S. Mahdavi, I. Moshkov, G. Armstrong, S. Ge, E. Minasyan, F. Chen, and I. Gitman. Nemotron-math: Efficient long-context distillation of mathematical reasoning from multi-mode supervision.arXiv preprint arXiv:2512.15489, 2025

  11. [11]

    C. Gao, X. Wu, Z. Lin, D. Zhang, and S. Hu. Nextlong: Toward effective long-context training without long documents, 2025

  12. [12]

    H. Ge, J. Feng, Q. Huang, F. Fu, X. Nie, L. Zuo, H. Lin, B. Cui, and X. Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus.arXiv preprint arXiv:2502.21231, 2025

  13. [13]

    Gloeckle, B

    F. Gloeckle, B. Y . Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

  14. [14]

    Y . Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models. InICLR’23, 2025

  15. [15]

    Y . Gu, Q. Hu, S. Yang, H. Xi, J. Chen, S. Han, and H. Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

  16. [16]

    Y . He, S. Li, J. Liu, Y . Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, Z. Lin, X. Liu, D. Sun, S. Lin, Z. Zheng, X. Zhu, W. Su, and B. Zheng. Chinese simpleqa: A chinese factuality evaluation for large language models, 2024

  17. [17]

    Hsieh, S

    C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models? InCOLM’24, 2024

  18. [18]

    J. Jia, Z. Chen, X. Wu, C. Gao, Z. Lin, D. Zhang, S. Hu, and B. Guo. Entropylong: Effective long-context training via predictive uncertainty, 2025

  19. [19]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  20. [20]

    Leviathan, M

    Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InICML’23, pages 19274–19286, 2023. 32

  21. [21]

    J. Li, A. Fang, G. Smyrnis, M. Ivgi, and et al. Datacomp-lm: In search of the next generation of training sets for language models, 2025

  22. [22]

    J. Li, W. Zhao, J. Zhao, W. Zeng, H. Wu, X. Wang, R. Ge, Y . Cao, Y . Huang, W. Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025

  23. [23]

    R. Li, J. Fu, B.-W. Zhang, T. Huang, Z. Sun, C. Lyu, G. Liu, Z. Jin, and G. Li. Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

  24. [24]

    A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  25. [25]

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  26. [26]

    A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  27. [27]

    J. Liu, J. Le Tian, V . Daita, Y . Wei, Y . Ding, Y . K. Wang, J. Yang, and L. ZHANG. Repoqa: Evaluating long context code understanding. InFirst Workshop on Long-Context Foundation Models@ ICML 2024

  28. [28]

    Lu and T

    K. Lu and T. M. Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

  29. [29]

    Luong, D

    M.-T. Luong, D. Hwang, H. H. Nguyen, G. Ghiasi, Y . Chervonyi, I. Seo, J. Kim, G. Bingham, J. Lee, S. Mishra, et al. Towards robust mathematical reasoning. InEMNLP’25, pages 35406– 35430, 2025

  30. [30]

    Aimo-2 winning solution: Building state-of-the-art math- ematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025

    I. Moshkov, D. Hanley, I. Sorokin, S. Toshniwal, C. Henkel, B. Schifferer, W. Du, and I. Git- man. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025

  31. [31]

    Narayanan, M

    D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021

  32. [32]

    Introducing gpt 5.2, 2025

    OpenAI. Introducing gpt 5.2, 2025

  33. [33]

    Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek

    T. Patwardhan, R. Dias, E. Proehl, G. Kim, M. Wang, O. Watkins, S. P. Fishman, M. Aljubeh, P. Thacker, L. Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

  34. [34]

    L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

  35. [35]

    Synthetic-2 release: Four million collaboratively generated reasoning traces,

    Prime Intellect. Synthetic-2 release: Four million collaboratively generated reasoning traces,

  36. [36]

    Pyatkin, S

    V . Pyatkin, S. Malik, V . Graf, H. Ivison, S. Huang, P. Dasigi, N. Lambert, and H. Hajishirzi. Generalizing verifiable instruction following, 2025

  37. [37]

    P. Qi, X. Wan, G. Huang, and M. Lin. Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023

  38. [38]

    Rajbhandari, J

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: Memory optimizations toward training trillion parameter models, 2020

  39. [39]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InCoLM’24, 2024. 33

  40. [40]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  41. [41]

    Sirdeshmukh, K

    V . Sirdeshmukh, K. Deshpande, J. Mols, L. Jin, E.-Y . Cardona, D. Lee, J. Kritz, W. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms, 2025

  42. [42]

    H. F. Team. Harbor: A framework for evaluating and optimizing agents and models in container environments., 2026

  43. [43]

    K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  44. [44]

    L. Team, A. Shen, B. Li, B. Hu, B. Jing, C. Chen, C. Huang, C. Zhang, C. Yang, C. Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025

  45. [45]

    T. T.-B. Team. Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025

  46. [46]

    Y . Tian, C. Wang, Z. Liu, H. Huang, W. Yu, D. Song, J. Tang, and Y . Guo. Beyond literal mapping: Benchmarking and improving non-literal translation evaluation, 2026

  47. [47]

    Y . Wang, S. Wang, S. Zhu, F. Fu, X. Liu, X. Xiao, H. Li, J. Li, F. Wu, and B. Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. InASPLOS’25, pages 421–436, 2025

  48. [48]

    Z. Wang, T. Shi, J. He, M. Cai, J. Zhang, and D. Song. Cybergym: Evaluating ai agents’ cyber- security capabilities with real-world vulnerabilities at scale.arXiv preprint arXiv:2506.02548, 2025

  49. [49]

    J. Wei, N. Karina, H. W. Chung, Y . J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. Measuring short-form factuality in large language models, 2024

  50. [50]

    J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  51. [51]

    L.-C. Xiaomi. Mimo-v2-flash technical report, 2026

  52. [52]

    A. Yang, A. Li, B. Yang, B. Zhang, and et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  53. [53]

    J. Yang, K. Lieret, C. E. Jimenez, A. Wettig, K. Khandpur, Y . Zhang, B. Hui, O. Press, L. Schmidt, and D. Yang. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798, 2025

  54. [54]

    S. Yang, J. Kautz, and A. Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InICLR’24, 2024

  55. [55]

    S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  56. [56]

    H. Yen, T. Gao, M. Hou, K. Ding, D. Fleischer, P. Izsak, M. Wasserblat, and D. Chen. Helmet: How to evaluate long-context language models effectively and thoroughly.arXiv preprint arXiv:2410.02694, 2024

  57. [57]

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  58. [58]

    T. Yuan, Y . Liu, X. Ye, S. Zhang, J. Tan, B. Chen, C. Song, and D. Zhang. Accelerating the training of large language models using efficient activation rematerialization and optimal hybrid parallelism. InUSENIX ATC’24, pages 545–561, 2024. 34

  59. [59]

    2505.23419 , archivePrefix =

    L. Zhang, S. He, C. Zhang, Y . Kang, B. Li, C. Xie, J. Wang, M. Wang, Y . Huang, S. Fu, E. Nallipogu, Q. Lin, Y . Dang, S. Rajmohan, and D. Zhang. Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025

  60. [60]

    C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Ma, et al. Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures. InISCA’25, pages 1731–1745, 2025

  61. [61]

    X. Zhao, Y . Liu, K. Xu, J. Guo, Z. Wang, Y . Sun, X. Kong, Q. Cao, L. Jiang, Z. Wen, Z. Zhang, and J. Zhou. Small leak can sink a great ship–boost rl training on moe with icepop!, Sep 2025

  62. [62]

    Group Sequence Policy Optimization

    C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y . Liu, R. Men, A. Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  63. [63]

    Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

    P. Zhou, B. Leon, X. Ying, C. Zhang, Y . Shao, Q. Ye, D. Chong, Z. Jin, C. Xie, M. Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv preprint arXiv:2504.19314, 2025. 35 A Hyper-Parameters Hyper-parameters related to the model architecture of GLM-5 are shown in Table 10. For training, we follow the sett...

  64. [64]

    If the agent asks for information NOT in the instruction: - Say you don’t remember or don’t have it - Offer alternative information that IS mentioned in the instruction

  65. [65]

    Sorry, I don’t remember the order ID, can you search for it? My name/email/phone number/zipcode is

    Examples: - If asked for order ID (not in instruction): "Sorry, I don’t remember the order ID, can you search for it? My name/email/phone number/zipcode is ..." - If asked for email (not in instruction): "I don’t have my email handy, but I can give you my name and zip code which are..." - Do not repeat the exact instruction in the conversation. Instead, u...