pith. machine review for the scientific record. sign in

arxiv: 2508.06471 · v1 · submitted 2025-08-08 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5 Team: Aohan Zeng , Xin Lv , Qinkai Zheng , Zhenyu Hou , Bin Chen , Chengxing Xie , Cunxiang Wang , Da Yin
show 161 more authors
Hao Zeng Jiajie Zhang Kedong Wang Lucen Zhong Mingdao Liu Rui Lu Shulin Cao Xiaohan Zhang Xuancheng Huang Yao Wei Yean Cheng Yifan An Yilin Niu Yuanhao Wen Yushi Bai Zhengxiao Du Zihan Wang Zilin Zhu Bohan Zhang Bosi Wen Bowen Wu Bowen Xu Can Huang Casey Zhao Changpeng Cai Chao Yu Chen Li Chendi Ge Chenghua Huang Chenhui Zhang Chenxi Xu Chenzheng Zhu Chuang Li Congfeng Yin Daoyan Lin Dayong Yang Dazhi Jiang Ding Ai Erle Zhu Fei Wang Gengzheng Pan Guo Wang Hailong Sun Haitao Li Haiyang Li Haiyi Hu Hanyu Zhang Hao Peng Hao Tai Haoke Zhang Haoran Wang Haoyu Yang He Liu He Zhao Hongwei Liu Hongxi Yan Huan Liu Huilong Chen Ji Li Jiajing Zhao Jiamin Ren Jian Jiao Jiani Zhao Jianyang Yan Jiaqi Wang Jiayi Gui Jiayue Zhao Jie Liu Jijie Li Jing Li Jing Lu Jingsen Wang Jingwei Yuan Jingxuan Li Jingzhao Du Jinhua Du Jinxin Liu Junkai Zhi Junli Gao Ke Wang Lekang Yang Liang Xu Lin Fan Lindong Wu Lintao Ding Lu Wang Man Zhang Minghao Li Minghuan Xu Mingming Zhao Mingshu Zhai Pengfan Du Qian Dong Shangde Lei Shangqing Tu Shangtong Yang Shaoyou Lu Shijie Li Shuang Li Shuang-Li Shuxun Yang Sibo Yi Tianshu Yu Wei Tian Weihan Wang Wenbo Yu Weng Lam Tam Wenjie Liang Wentao Liu Xiao Wang Xiaohan Jia Xiaotao Gu Xiaoying Ling Xin Wang Xing Fan Xingru Pan Xinyuan Zhang Xinze Zhang Xiuqing Fu Xunkai Zhang Yabo Xu Yandong Wu Yida Lu Yidong Wang Yilin Zhou Yiming Pan Ying Zhang Yingli Wang Yingru Li Yinpei Su Yipeng Geng Yitong Zhu Yongkun Yang Yuhang Li Yuhao Wu Yujiang Li Yunan Liu Yunqing Wang Yuntao Li Yuxuan Zhang Zezhen Liu Zhen Yang Zhengda Zhou Zhongpei Qiao Zhuoer Feng Zhuorui Liu Zichen Zhang Zijun Yao Zikang Wang Ziqiang Liu Ziwei Chai Zixuan Li Zuodong Zhao Wenguang Chen Jidong Zhai Bin Xu Minlie Huang Hongning Wang Juanzi Li Yuxiao Dong Jie Tang
Authors on Pith no claims yet

Pith reviewed 2026-05-11 17:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords mixture of expertslarge language modelagentic tasksreasoningcodingopen sourcebenchmark evaluationreinforcement learning
0
0 comments X

The pith

GLM-4.5 reaches 70.1 percent on TAU-Bench and 91 percent on AIME 24 using an open-source 355B-parameter MoE model with only 32B parameters active at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GLM-4.5 as an open-source Mixture-of-Experts model built to handle agentic tasks, complex reasoning, and coding problems. It describes a multi-stage training process on 23 trillion tokens followed by expert iteration and reinforcement learning that produces a hybrid reasoning capability allowing both extended thinking traces and direct answers. The model posts the listed benchmark scores and ranks near the top of evaluated systems despite activating far fewer parameters than some denser competitors. A smaller 106B-parameter variant is also released to broaden access. The work aims to supply capable tools for building practical AI agents and technical problem solvers.

Core claim

GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks.

What carries the argument

The hybrid reasoning method that supports both thinking and direct response modes, built inside a Mixture-of-Experts architecture with 355 billion total parameters but only 32 billion activated per token.

Load-bearing premise

That the reported benchmark scores reflect genuine capabilities measured through fair, standardized, and uncontaminated evaluations that allow direct comparison to other models.

What would settle it

Independent re-evaluation of the model on the same benchmark problems using fresh, publicly documented prompts and code, or testing on a new suite of problems created after the training cutoff, would confirm or refute the claimed scores.

read the original abstract

We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters. It features a hybrid reasoning method that supports both thinking and direct response modes. The model undergoes multi-stage training on 23T tokens and post-training with expert model iteration and reinforcement learning. GLM-4.5 reports strong results across agentic, reasoning, and coding (ARC) tasks, including 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. It ranks 3rd overall among evaluated models and 2nd on agentic benchmarks despite having fewer parameters than several competitors. A compact variant, GLM-4.5-Air (106B parameters), is also released, with code and models made available at a GitHub repository.

Significance. If the benchmark results hold under verifiable and standardized conditions, the work advances open-source models for agentic and reasoning tasks by demonstrating competitive performance with an efficient MoE architecture and hybrid reasoning. The public release of both the full and compact models, along with code, is a clear strength that enables reproducibility and community follow-up research on ARC capabilities.

major comments (1)
  1. [Abstract] Abstract: The central performance claims, including the specific scores of 70.1% on TAU-Bench and 64.2% on SWE-bench Verified together with the 3rd overall and 2nd agentic ranking, are presented without any description of the evaluation methodology. Details on agent scaffolding, tool-use protocols, attempt limits, prompting consistency, use of the hybrid thinking mode, and data-contamination controls are required to establish that the results are comparable to those of competing models; their absence undermines confidence in the headline rankings.
minor comments (2)
  1. [Abstract] The phrase 'expert model iteration' in the abstract is used without definition or reference to a methods section; a brief clarification would improve readability.
  2. The efficiency claim ('much fewer parameters than several competitors') would be strengthened by explicitly listing the parameter counts of the referenced competing models in a comparison table.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback highlights an important point about ensuring transparency in the abstract for benchmark results. We address this directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims, including the specific scores of 70.1% on TAU-Bench and 64.2% on SWE-bench Verified together with the 3rd overall and 2nd agentic ranking, are presented without any description of the evaluation methodology. Details on agent scaffolding, tool-use protocols, attempt limits, prompting consistency, use of the hybrid thinking mode, and data-contamination controls are required to establish that the results are comparable to those of competing models; their absence undermines confidence in the headline rankings.

    Authors: We agree that the abstract, constrained by length, omits explicit methodology details, which can affect immediate assessment of comparability. The full manuscript contains sections on evaluation protocols that cover agent scaffolding (standard setups for TAU-Bench and SWE-bench), tool-use protocols, attempt limits, prompting strategies, selective use of the hybrid thinking mode, and data-contamination controls via held-out test sets and decontamination procedures. In the revision, we will expand the abstract with a concise clause summarizing these elements and add cross-references to the detailed methodology sections. This change will improve clarity while preserving the abstract's brevity. We do not believe the core results or rankings require alteration, only better contextualization. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark reporting

full rationale

The paper describes training GLM-4.5 (355B MoE) on 23T tokens with post-training and RL, then reports measured benchmark scores (70.1% TAU-Bench, 91.0% AIME 24, 64.2% SWE-bench Verified). No mathematical derivations, equations, fitted predictions, or first-principles results exist. Claims rest on independent empirical evaluations with no self-definitional loops, fitted-input predictions, or load-bearing self-citations that reduce the central results to inputs by construction. Standard model-release structure; derivation chain is absent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical report on a trained foundation model, the central claims rest on standard machine learning assumptions including the validity of benchmark evaluations and the effectiveness of the described training pipeline; no novel axioms, free parameters, or invented entities are introduced beyond typical hyperparameter choices in LLM training.

pith-pipeline@v0.9.0 · 6174 in / 1376 out tokens · 98003 ms · 2026-05-11T17:42:50.551830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks.

  • IndisputableMonolith.Foundation.PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes.

  • IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

    cs.AR 2026-05 conditional novelty 8.0

    Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

  2. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  3. WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

    cs.CV 2026-05 unverdicted novelty 8.0

    WildTableBench is the first benchmark for multimodal models on naturally occurring table images, with only one of 21 tested models exceeding 50% accuracy and most ranging from 4.1% to 49.9%.

  4. Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

    cs.AI 2026-04 unverdicted novelty 8.0

    User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

  5. GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

    cs.CY 2026-05 unverdicted novelty 7.0

    A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.

  6. CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

  7. StoryAlign: Evaluating and Training Reward Models for Story Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.

  8. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 unverdicted novelty 7.0

    Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...

  9. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 conditional novelty 7.0

    Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.

  10. AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization

    cs.CR 2026-04 unverdicted novelty 7.0

    AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.

  11. Dr.Sai: An agentic AI for real-world physics analysis at BESIII

    hep-ex 2026-04 unverdicted novelty 7.0

    Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.

  12. Towards Temporal Compositional Reasoning in Long-Form Sports Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.

  13. FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

    cs.DC 2026-04 unverdicted novelty 7.0

    FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.

  14. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

  15. AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

    cs.IR 2026-04 unverdicted novelty 7.0

    A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.

  16. E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

    cs.SE 2026-04 unverdicted novelty 7.0

    E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.

  17. ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

    cs.AI 2026-04 unverdicted novelty 7.0

    ImplicitMemBench shows no LLM exceeds 66% on implicit memory tasks, with top models at 65%, far below humans and pointing to architectural limits beyond scaling.

  18. Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

  19. Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics

    cs.DC 2026-04 unverdicted novelty 7.0

    Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.

  20. Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

    cs.SE 2026-05 accept novelty 6.0

    Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.

  21. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

    cs.AI 2026-05 unverdicted novelty 6.0

    ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.

  22. Edit-Based Refinement for Parallel Masked Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.

  23. Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

    cs.AI 2026-05 unverdicted novelty 6.0

    Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.

  24. VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

    cs.SE 2026-05 unverdicted novelty 6.0

    VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.

  25. WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation

    cs.CR 2026-05 unverdicted novelty 6.0

    WebTrap uses multi-step instruction fusion and context-grounded generation to stealthily hijack browser agents mid-navigation while preserving original task success.

  26. HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

    cs.LG 2026-05 unverdicted novelty 6.0

    HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...

  27. Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

    cs.CV 2026-05 conditional novelty 6.0

    Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.

  28. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

  29. Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...

  30. AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion

    cs.CV 2026-05 unverdicted novelty 6.0

    AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.

  31. AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

    cs.AR 2026-04 unverdicted novelty 6.0

    AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...

  32. MAIC-UI: Making Interactive Courseware with Generative UI

    cs.CL 2026-04 unverdicted novelty 6.0

    MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluat...

  33. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  34. In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

    cs.CL 2026-04 unverdicted novelty 6.0

    Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...

  35. DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.

  36. AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

    cs.CV 2026-04 unverdicted novelty 6.0

    AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.

  37. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.

  38. ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.

  39. Towards Knowledgeable Deep Research: Framework and Benchmark

    cs.AI 2026-04 unverdicted novelty 6.0

    The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.

  40. ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment

    cs.IR 2026-04 unverdicted novelty 6.0

    ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.

  41. Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

    cs.LG 2026-04 unverdicted novelty 6.0

    Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

  42. Learning to Retrieve from Agent Trajectories

    cs.IR 2026-03 conditional novelty 6.0

    Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.

  43. OptiMat Alloys: a FAIR, living database of multi-principal element alloys enabled by a conversational agent

    cond-mat.mtrl-sci 2026-04 unverdicted novelty 5.0

    OptiMat Alloys is a conversational AI system that maintains a living FAIR database of multi-principal element alloy calculations and enables natural-language, on-demand computations with built-in uncertainty checks.

  44. OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.

  45. Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

    cs.AI 2026-04 unverdicted novelty 5.0

    Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.

  46. LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

  47. Apriel-1.5-OpenReasoner: RL Post-Training for General-Purpose and Efficient Reasoning

    cs.LG 2026-04 unverdicted novelty 5.0

    Apriel-1.5-OpenReasoner uses RL post-training with adaptive sampling and difficulty-aware penalties to boost reasoning accuracy on AIME, GPQA, MMLU-Pro and LiveCodeBench while producing shorter traces and generalizing...

  48. Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs

    cs.SE 2026-04 unverdicted novelty 5.0

    STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.

  49. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    cs.CL 2025-12 unverdicted novelty 5.0

    DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.

  50. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  51. Can Muon Fine-tune Adam-Pretrained Models?

    cs.LG 2026-05 unverdicted novelty 4.0

    Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.

  52. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 4.0

    Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

  53. Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

    cs.PF 2026-05 unverdicted novelty 4.0

    Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.

  54. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 3.0

    Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 51 Pith papers · 18 internal anchors

  1. [1]

    Abbas, K

    A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023

  2. [2]

    C. An, Z. Xie, X. Li, L. Li, J. Zhang, S. Gong, M. Zhong, J. Xu, X. Qiu, M. Wang, and L. Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

  3. [3]

    Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3119–3137, 2024

  4. [4]

    Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, Vienna, Austria, July 202...

  5. [5]

    Bavarian, H

    M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle, 2022

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  7. [7]

    A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025

  8. [8]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  9. [9]

    Cheng, Y

    S. Cheng, Y . Bao, Q. Cao, L. Huang, L. Kang, Z. Liu, Y . Lu, W. Zhu, Z. Huang, T. Li, et al. Seed-x: Building strong multilingual translation llm with 7b parameters. arXiv preprint arXiv:2507.13618, 2025

  10. [10]

    Deshpande, V

    K. Deshpande, V . Sirdeshmukh, J. B. Mols, L. Jin, E.-Y . Hernandez-Cardona, D. Lee, J. Kritz, W. E. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evalua- tion benchmark challenging to frontier llms. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702, 2025

  11. [11]

    H. Ding, Z. Wang, G. Paolini, V . Kumar, A. Deoras, D. Roth, and S. Soatto. Fewer truncations improve language modeling. In Proceedings of the 41st International Conference on Machine Learning, pages 11030–11048, 2024

  12. [12]

    Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

    F. Gloeckle, B. Y . Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024

  13. [13]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)

  15. [15]

    Henry, P

    A. Henry, P. R. Dachapally, S. Pawar, and Y . Chen. Query-key normalization for transformers, 2020

  16. [16]

    Hsieh, S

    C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models? In First Conference on Language Modeling. 23

  17. [17]

    S. Hu, Y . Tu, X. Han, G. Cui, C. He, W. Zhao, X. Long, Z. Zheng, Y . Fang, Y . Huang, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. In First Conference on Language Modeling

  18. [18]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  19. [19]

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations

  20. [20]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

  21. [21]

    Jordan, Y

    K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cecista, L. Newhouse, and J. Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon, 6

  22. [22]

    Joulin, E

    A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Compu- tational Linguistics: Volume 2, Short Papers, pages 427–431. Association for Computational Linguistics, April 2017

  23. [23]

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  24. [24]

    J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025

  25. [25]

    M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model- by-Scaling-RL-19681902c1468005bed8ca303013a4e2, 2025. Notion Blog

  26. [26]

    S. G. Patil, H. Mao, C. Cheng-Jie Ji, F. Yan, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

  27. [27]

    arXiv preprint arXiv:2506.20920 , year=

    G. Penedo, H. Kydlí ˇcek, V . Sabolˇcec, B. Messmer, N. Foroutan, A. H. Kargaran, C. Raffel, M. Jaggi, L. V on Werra, and T. Wolf. Fineweb2: One pipeline to scale them all–adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920, 2025

  28. [28]

    L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025

  29. [29]

    Y . Qin, T. Zhang, Y . Shen, W. Luo, Y . Zhang, Y . Qiao, Z. Zhou, W. Zhang, B. CUI, et al. Sysbench: Can llms follow system message? In The Thirteenth International Conference on Learning Representations, 2024

  30. [30]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

  31. [31]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  32. [32]

    D. Su, K. Kong, Y . Lin, J. Jennings, B. Norick, M. Kliegl, M. Patwary, M. Shoeybi, and B. Catanzaro. Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset. arXiv preprint arXiv:2412.02595, 2024

  33. [33]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 24

  34. [34]

    K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025

  35. [35]

    T. T.-B. Team. Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025

  36. [36]

    M. Tian, L. Gao, S. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y . Li, et al. Scicode: A research coding benchmark curated by scientists. Advances in Neural Information Processing Systems, 37:30624–30650, 2024

  37. [37]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  38. [38]

    V odrahalli, S

    K. V odrahalli, S. Ontanon, N. Tripuraneni, K. Xu, S. Jain, R. Shivanna, J. Hui, N. Dikkala, M. Kazemi, B. Fatemi, R. Anil, E. Dyer, S. Shakeri, R. Vij, H. Mehta, V . Ramasesh, Q. Le, E. Chi, Y . Lu, O. Firat, A. Lazaridou, J.-B. Lespiau, N. Attaluri, and K. Olszewska. Michelan- gelo: Long context evaluations beyond haystacks via latent structure queries, 2024

  39. [39]

    F. Wan, W. Shen, S. Liao, Y . Shi, C. Li, Z. Yang, J. Zhang, F. Huang, J. Zhou, and M. Yan. Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning. arXiv preprint arXiv:2505.17667, 2025

  40. [40]

    L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts. arXiv preprint arXiv:2408.15664, 2024

  41. [41]

    S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025

  42. [42]

    X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Lea...

  43. [43]

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  44. [44]

    J. Wei, N. Karina, H. W. Chung, Y . J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. Measuring short-form factuality in large language models, 2024

  45. [45]

    J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516, 2025

  46. [46]

    Z. Xi, Y . Ding, W. Chen, B. Hong, H. Guo, J. Wang, D. Yang, C. Liao, X. Guo, W. He, et al. Agentgym: Evolving large language model-based agents across diverse environments. arXiv preprint arXiv:2406.04151, 2024

  47. [47]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  48. [48]

    S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045, 2024

  49. [49]

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  50. [50]

    A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations. 25

  51. [51]

    Safetybench: Eval- uating the safety of large language models with mul- tiple choice questions

    Z. Zhang, L. Lei, L. Wu, R. Sun, Y . Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045, 2023

  52. [52]

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 26