hub

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team: Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen · 2026 · cs.LG · arXiv 2602.15763

60 Pith papers cite this work. Polarity classification is still indexing.

60 Pith papers citing it

open full Pith review browse 60 citing papers arXiv PDF

abstract

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that fur

co-cited works

representative citing papers

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

cs.CY · 2026-05-11 · accept · novelty 7.0 · 2 refs

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.

MolViBench: Evaluating LLMs on Molecular Vibe Coding

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

MolViBench is the first benchmark designed to evaluate LLMs on generating executable programs for molecular tasks in drug discovery.

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

cs.CR · 2026-05-04 · unverdicted · novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

MathDuels: Evaluating LLMs as Problem Posers and Solvers

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.

ClawBench: Can AI Agents Complete Everyday Online Tasks?

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

cs.AI · 2026-04-03 · unverdicted · novelty 7.0

AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.

DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

cs.SE · 2026-05-13 · unverdicted · novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

citing papers explorer

Showing 50 of 60 citing papers.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 50 · internal anchor
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 52 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 72 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 44 · internal anchor
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation cs.CL · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents cs.LG · 2026-05-11 · unverdicted · none · ref 85 · internal anchor
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs cs.CY · 2026-05-11 · accept · none · ref 44 · 2 links · internal anchor
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness cs.CL · 2026-05-11 · unverdicted · none · ref 56 · internal anchor
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning cs.CL · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 48 · internal anchor
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference cs.LG · 2026-05-08 · conditional · none · ref 31 · internal anchor
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution cs.SE · 2026-05-07 · unverdicted · none · ref 5 · internal anchor
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
MolViBench: Evaluating LLMs on Molecular Vibe Coding cs.CL · 2026-05-04 · unverdicted · none · ref 29 · internal anchor
MolViBench is the first benchmark designed to evaluate LLMs on generating executable programs for molecular tasks in drug discovery.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents cs.CR · 2026-05-04 · unverdicted · none · ref 66 · internal anchor
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
MathDuels: Evaluating LLMs as Problem Posers and Solvers cs.CL · 2026-04-23 · unverdicted · none · ref 14 · internal anchor
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs cs.SE · 2026-04-19 · unverdicted · none · ref 20 · internal anchor
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation cs.CV · 2026-04-15 · unverdicted · none · ref 28 · internal anchor
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
ClawBench: Can AI Agents Complete Everyday Online Tasks? cs.CL · 2026-04-09 · unverdicted · none · ref 11 · internal anchor
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents cs.AI · 2026-04-03 · unverdicted · none · ref 31 · internal anchor
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72 cs.DC · 2026-04-02 · unverdicted · none · ref 8 · internal anchor
DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.
MinT: Managed Infrastructure for Training and Serving Millions of LLMs cs.LG · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation cs.CL · 2026-05-13 · unverdicted · none · ref 41 · internal anchor
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 13 · internal anchor
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 59 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction cs.LG · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.
ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design cs.LG · 2026-05-11 · unverdicted · none · ref 39 · internal anchor
ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing designability and an 8x speedup over RL baselines.
Flow-OPD: On-Policy Distillation for Flow Matching Models cs.CV · 2026-05-08 · unverdicted · none · ref 19 · 2 links · internal anchor
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
LoopTrap: Termination Poisoning Attacks on LLM Agents cs.CR · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL · 2026-05-07 · conditional · none · ref 10 · internal anchor
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe cs.LG · 2026-05-05 · unverdicted · none · ref 49 · internal anchor
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Affordance Agent Harness: Verification-Gated Skill Orchestration cs.RO · 2026-05-01 · unverdicted · none · ref 82 · 2 links · internal anchor
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 13 · internal anchor
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 13 · internal anchor
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
MAIC-UI: Making Interactive Courseware with Generative UI cs.CL · 2026-04-28 · unverdicted · none · ref 58 · internal anchor
MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluations showing usability gains and learning improvements.
QuantClaw: Precision Where It Matters for OpenClaw cs.AI · 2026-04-24 · unverdicted · none · ref 34 · internal anchor
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
Temporally Extended Mixture-of-Experts Models cs.LG · 2026-04-22 · unverdicted · none · ref 14 · internal anchor
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards cs.CV · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unverdicted · none · ref 18 · internal anchor
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.
Toward Autonomous Long-Horizon Engineering for ML Research cs.CL · 2026-04-14 · unverdicted · none · ref 27 · internal anchor
AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe cs.LG · 2026-04-14 · unverdicted · none · ref 27 · internal anchor
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection cs.CR · 2026-04-13 · unverdicted · none · ref 44 · 2 links · internal anchor
ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents cs.AI · 2026-04-07 · unverdicted · none · ref 53 · internal anchor
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
InCoder-32B-Thinking: Industrial Code World Model for Thinking cs.AR · 2026-04-03 · unverdicted · none · ref 47 · internal anchor
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 3 · internal anchor
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents cs.SE · 2026-04-02 · unverdicted · none · ref 9 · internal anchor
A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multilingual version.
UserGPT Technical Report cs.IR · 2026-05-09 · unverdicted · none · ref 63 · internal anchor
UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while compressing records by up to 97.9%.
Learning CLI Agents with Structured Action Credit under Selective Observation cs.AI · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents cs.LG · 2026-05-07 · unverdicted · none · ref 51 · internal anchor
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing, recommendation, and protein tasks.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving cs.DC · 2026-05-07 · unverdicted · none · ref 12 · internal anchor
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents cs.CV · 2026-04-29 · unverdicted · none · ref 50 · 3 links · internal anchor
GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.

GLM-5: from Vibe Coding to Agentic Engineering

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer