CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
hub Mixed citations
GLM-5: from Vibe Coding to Agentic Engineering
Mixed citation behavior. Most common role is background (68%).
abstract
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that fur
co-cited works
years
2026 82representative citing papers
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
MolViBench is the first benchmark designed to evaluate LLMs on generating executable programs for molecular tasks in drug discovery.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
SkillMOO applies LLM-proposed edits and NSGA-II Pareto optimization to skill bundles for SE agents, ranking top in pass rate on most SkillsBench tasks while cutting costs up to 31.7%.
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.
citing papers explorer
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
-
Towards Direct Evaluation of Harness Optimizers via Priority Ranking
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
-
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning
ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.
-
Learning CLI Agents with Structured Action Credit under Selective Observation
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
-
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
-
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
- ClawEnvKit: Automatic Environment Generation for Claw-Like Agents