super hub Mixed citations

GLM-5: from Vibe Coding to Agentic Engineering

Bin Chen, GLM-5-Team: Aohan Zeng, Qinkai Zheng, Xin Lv, Zhengxiao Du, Zhenyu Hou · 2026 · cs.LG · arXiv 2602.15763

Mixed citation behavior. Most common role is background (69%).

167 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 167 citing papers more from Bin Chen arXiv PDF

abstract

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 24 baseline 6 dataset 2 method 2 other 2

citation-polarity summary

background 25 baseline 6 unclear 2 use method 2 use dataset 1

claims ledger

abstract We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that fur

authors

Bin Chen GLM-5-Team: Aohan Zeng Qinkai Zheng Xin Lv Zhengxiao Du Zhenyu Hou

co-cited works

representative citing papers

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

cs.LG · 2026-06-18 · unverdicted · novelty 8.0

StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

cs.CL · 2026-06-15 · unverdicted · novelty 8.0 · 3 refs

MetaSyn benchmark shows LLM pipelines recover at most 52.7% of ground-truth included studies due to screening failures on PI/ECO eligibility, despite 90.9% retrieval recall at K=200.

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

cs.CL · 2026-06-11 · unverdicted · novelty 8.0

LoHoSearch is a new benchmark of 544 KG-constructed questions across 11 domains where the strongest search agent scores 34.74% and context strategies add at most 6.8%.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering

cs.DC · 2026-06-30 · unverdicted · novelty 7.0

SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

Dockerless: Environment-Free Program Verifier for Coding Agents

cs.SE · 2026-06-26 · unverdicted · novelty 7.0

Dockerless uses agentic repository exploration to verify patches without execution, enabling SFT and RL training of coding agents that reach 62.0/50.0/35.2% resolve rates on SWE-bench Verified/Multilingual/Pro while matching environment-based results.

Toward Agentic SysAdmin: Rethinking System Administration with AI Agents

cs.NI · 2026-06-25 · unverdicted · novelty 7.0

NetLLMeval is an emulation-based framework for benchmarking LLM solvers on network admin tasks, with a 24000-run study showing solver architecture lifts a 14B model from 0.43 to 0.88 accuracy and allows local models to match frontier systems.

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

CLI-Universe synthesizes a verified 6K dataset of terminal-agent tasks that, when used to fine-tune Qwen3-32B, reaches 33.4% on Terminal-Bench 2.0 and sets a new open-source SOTA for models at or below 32B parameters.

MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

cs.AI · 2026-06-21 · unverdicted · novelty 7.0

MacAgentBench is a new benchmark for macOS AI agents with 676 tasks, deterministic multi-checkpoint evaluation, and tests across frameworks showing skill libraries drive performance more than framework design.

AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?

cs.SD · 2026-06-19 · unverdicted · novelty 7.0

Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.

Agentic Time Machine as an Infrastructure for Future-Event Forecasting

cs.AI · 2026-06-19 · unverdicted · novelty 7.0

Agentic Time Machine reconstructs historical web states for offline evaluation of forecasting agents, with a multi-agent framework achieving top ranks on FutureX live and past benchmarks.

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

cs.SE · 2026-06-17 · unverdicted · novelty 7.0

StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

cs.LG · 2026-06-15 · conditional · novelty 7.0

PowerOPD applies the Box-Cox power transformation to create natively bounded, sign-consistent rewards for on-policy distillation, delivering up to +6.37 Avg@8 gains over vanilla OPD on math reasoning benchmarks while cutting compute costs.

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

FORT synthesizes shortcut-resistant search tasks by controlling four identified shortcut risks across entity selection, graph construction, question formulation, and refinement, producing training data that yields agents with longer search trajectories and top performance among open-source models on

AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments

cs.CR · 2026-06-09 · unverdicted · novelty 7.0

AgentCanary introduces an Entry × Impact risk taxonomy, high-fidelity real tool environments with persistent state, and multi-dimensional trajectory evaluation to assess AI agent security across models and attacks.

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.

citing papers explorer

Showing 40 of 40 citing papers after filters.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? cs.AI · 2026-06-03 · unverdicted · none · ref 19 · internal anchor
AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 72 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs cs.AI · 2026-06-22 · unverdicted · none · ref 18 · internal anchor
HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents cs.AI · 2026-06-22 · unverdicted · none · ref 15 · internal anchor
CLI-Universe synthesizes a verified 6K dataset of terminal-agent tasks that, when used to fine-tune Qwen3-32B, reaches 33.4% on Terminal-Bench 2.0 and sets a new open-source SOTA for models at or below 32B parameters.
MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop cs.AI · 2026-06-21 · unverdicted · none · ref 2 · internal anchor
MacAgentBench is a new benchmark for macOS AI agents with 676 tasks, deterministic multi-checkpoint evaluation, and tests across frameworks showing skill libraries drive performance more than framework design.
Agentic Time Machine as an Infrastructure for Future-Event Forecasting cs.AI · 2026-06-19 · unverdicted · none · ref 6 · internal anchor
Agentic Time Machine reconstructs historical web states for offline evaluation of forecasting agents, with a multi-agent framework achieving top ranks on FutureX live and past benchmarks.
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory cs.AI · 2026-06-08 · unverdicted · none · ref 85 · internal anchor
SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
Knowledge Index of Noah's Ark cs.AI · 2026-06-03 · unverdicted · none · ref 37 · internal anchor
Introduces KINA benchmark with 899 items over 261 disciplines, formal (1-1/e) coverage guarantee and bonus-on-bar tournament theorem, plus evaluations of 42 models with top score 53.17%.
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models cs.AI · 2026-06-01 · conditional · none · ref 22 · internal anchor
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
PassNet: Scaling Large Language Models for Graph Compiler Pass Generation cs.AI · 2026-05-28 · unverdicted · none · ref 16 · internal anchor
PassNet provides a dataset of 18K graphs and PassBench for LLM-generated compiler passes, with fine-tuned models achieving 2.67x gains on long-tail tasks where TorchInductor underperforms.
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know? cs.AI · 2026-05-27 · unverdicted · none · ref 31 · internal anchor
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? cs.AI · 2026-05-26 · unverdicted · none · ref 23 · internal anchor
LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games cs.AI · 2026-05-17 · unverdicted · none · ref 47 · 2 links · internal anchor
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents cs.AI · 2026-04-03 · unverdicted · none · ref 31 · internal anchor
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments cs.AI · 2026-03-24 · unverdicted · none · ref 13 · internal anchor
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification cs.AI · 2026-07-02 · conditional · none · ref 45 · internal anchor
Vera automates safety testing for LLM agents via literature-driven risk taxonomies, combinatorial case generation, and evidence-grounded verification in isolated environments, showing 93.9% average attack success on four frameworks.
Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation cs.AI · 2026-07-02 · unverdicted · none · ref 3 · internal anchor
Hawk is a training-free framework that boosts NPU kernel generation accuracy to 80% and achieves up to 2.2x speedup via hardware-aware knowledge synthesis, 2D retrieval, and effect-driven distillation.
Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index cs.AI · 2026-06-30 · unverdicted · none · ref 32 · internal anchor
Introduces RSI metric and RSI-S filtering method for adaptive token selection in RLVR, reporting 2-3 point gains over GRPO on AIME/AMC benchmarks.
Cognitive World Models for Process-Level Social Influence Evaluation cs.AI · 2026-06-28 · unverdicted · none · ref 31 · internal anchor
CogWM is a new LLM user model for evaluating social influence by predicting and tracking cognitive state evolution in dialogues, trained on 150k samples and shown to differentiate AI agents effectively.
Agent-as-a-Router: Agentic Model Routing for Coding Tasks cs.AI · 2026-06-22 · unverdicted · none · ref 43 · 2 links · internal anchor
Agent-as-a-Router turns static LLM routing into an iterative C-A-F loop that accumulates execution feedback to lower cumulative regret on coding tasks.
LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents cs.AI · 2026-06-18 · unverdicted · none · ref 3 · internal anchor
LedgerAgent is an inference-time method that uses a structured ledger to track task states and enforce domain policies in tool-calling agents, improving average pass^k over standard prompt-based approaches across four domains.
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry cs.AI · 2026-06-12 · unverdicted · none · ref 9 · internal anchor
HarnessX assembles and evolves agent harnesses via substitution algebra and AEGIS trace analysis, reporting +14.5% average gains (up to +44%) on five benchmarks.
What Makes Interaction Trajectories Effective for Training Terminal Agents? cs.AI · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
Trajectories from weaker agents outperform stronger ones for training terminal agents due to environment-grounded supervision that exposes inspect-act-verify behaviors.
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment cs.AI · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches cs.AI · 2026-05-31 · unverdicted · none · ref 93 · 2 links · internal anchor
A survey of RLM use in 28 disciplines reveals uneven adoption and introduces a maturity assessment framework showing larger gaps when limited to public resources.
PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management cs.AI · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
PortBench supplies a static correlation QA set and a dynamic five-stage allocation pipeline, revealing that 90% of tested LLM configurations fail to beat equal-weight allocation and that constraint-satisfying models still incur large drawdowns under stress.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 69 · internal anchor
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis cs.AI · 2026-05-14 · unverdicted · none · ref 48 · internal anchor
EvoEnv lets a single policy synthesize, validate, and use Python environments with durable solve-verify asymmetry to improve reasoning performance on Qwen3-4B-Thinking from 72.4 to 74.8 while fixed-data baselines decline.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 59 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
QuantClaw: Precision Where It Matters for OpenClaw cs.AI · 2026-04-24 · unverdicted · none · ref 34 · internal anchor
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents cs.AI · 2026-04-07 · unverdicted · none · ref 53 · internal anchor
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning cs.AI · 2026-06-09 · unverdicted · none · ref 26 · internal anchor
ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.
Inference Time Context Sparsity: Illusion or Opportunity? cs.AI · 2026-05-22 · unverdicted · none · ref 55 · internal anchor
Current LLMs remain robust to high levels of inference-time context sparsity across diverse tasks, enabling up to 10x acceleration via sparse kernels.
Learning CLI Agents with Structured Action Credit under Selective Observation cs.AI · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 21 · internal anchor
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments cs.AI · 2026-04-07 · unverdicted · none · ref 23 · internal anchor
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
NebulaExp-8B: An Empirical Post-Training Pipeline via Full-Scale Ablation Research cs.AI · 2026-06-25 · unverdicted · none · ref 32 · internal anchor
NebulaExp reports an empirical post-training pipeline on Qwen3-8B that raises instruct scores from 55.01 to 61.85 and reasoning scores from 73.88 to 75.17 via curated data, SFT, GRPO RL, and OPD/MOPD distillation.
A Formula-Driven Survey and Research Agenda for On-Policy Distillation cs.AI · 2026-06-22 · unverdicted · none · ref 50 · internal anchor
A survey creates a taxonomy for on-policy distillation in LLMs that separates temporal credit assignment from vocabulary-level probability routing.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 96 · 2 links · internal anchor
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unreviewed · ref 18 · internal anchor

GLM-5: from Vibe Coding to Agentic Engineering

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer