super hub Mixed citations

GLM-5: from Vibe Coding to Agentic Engineering

Bin Chen, GLM-5-Team: Aohan Zeng, Qinkai Zheng, Xin Lv, Zhengxiao Du, Zhenyu Hou · 2026 · cs.LG · arXiv 2602.15763

Mixed citation behavior. Most common role is background (68%).

139 Pith papers citing it

Background 68% of classified citations

open full Pith review browse 139 citing papers more from Bin Chen arXiv PDF

abstract

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 22 baseline 6 dataset 2 method 2 other 2

citation-polarity summary

background 23 baseline 6 unclear 2 use method 2 use dataset 1

claims ledger

abstract We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that fur

authors

Bin Chen GLM-5-Team: Aohan Zeng Qinkai Zheng Xin Lv Zhengxiao Du Zhenyu Hou

co-cited works

representative citing papers

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

cs.CL · 2026-06-15 · unverdicted · novelty 8.0 · 2 refs

MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering

cs.DC · 2026-06-30 · unverdicted · novelty 7.0

SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

Dockerless: Environment-Free Program Verifier for Coding Agents

cs.SE · 2026-06-26 · unverdicted · novelty 7.0

Dockerless uses agentic repository exploration to verify patches without execution, enabling SFT and RL training of coding agents that reach 62.0/50.0/35.2% resolve rates on SWE-bench Verified/Multilingual/Pro while matching environment-based results.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.

OPRD: On-Policy Representation Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.

Reinforcement Learning from Rich Feedback with Distributional DAgger

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.

Knowledge Index of Noah's Ark

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.

D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

cs.DC · 2026-06-03 · unverdicted · novelty 7.0

D^2SD uses two diffusion drafters in a prefix tree structure with confidence scores to select and recover alternative draft sequences, achieving higher acceptance rates in speculative decoding.

Spectral Scaling Laws of Muon

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Muon momentum matrices show layer-dependent power-law scaling of stabilized singular value quantiles with model size from 77M to 2.8B parameters.

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

EntSQL is a new benchmark with 1,066 examples across five domains where top systems reach only 15.9% accuracy on English inputs when long-form enterprise documents are provided.

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

cs.CV · 2026-06-01 · conditional · novelty 7.0

Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

cs.AI · 2026-06-01 · conditional · novelty 7.0

AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CRAB-Bench and RUSE create a new evaluation framework for LLM agents on constraint-graph tasks with realistic human-like user behaviors, reporting 61% pass@1 for the best model and up to 57% further drops under RUSE.

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Introduces LongJudgeBench benchmark showing LLM judges remain unstable for long-form output evaluation even with rubrics or references.

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.

citing papers explorer

Showing 6 of 6 citing papers after filters.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 52 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents cs.AI · 2026-04-03 · unverdicted · none · ref 31 · internal anchor
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Affordance Agent Harness: Verification-Gated Skill Orchestration cs.RO · 2026-05-01 · unverdicted · none · ref 82 · 2 links · internal anchor
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents cs.AI · 2026-04-07 · unverdicted · none · ref 53 · internal anchor
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
InCoder-32B-Thinking: Industrial Code World Model for Thinking cs.AR · 2026-04-03 · unverdicted · none · ref 47 · internal anchor
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
Learning CLI Agents with Structured Action Credit under Selective Observation cs.AI · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.

GLM-5: from Vibe Coding to Agentic Engineering

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer