ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
Title resolution pending
36 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
WildTableBench is the first QA benchmark for naturally occurring table images, where 21 multimodal models were evaluated and only one exceeded 50% accuracy.
AutoMat benchmark shows current LLM coding agents achieve at most 54.1% success when reproducing computational materials science claims from papers.
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month leads across 23 Sierra Nevada sites.
MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical capability.
TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
MMM-Bench supplies 5,990 multi-modal documents from 12 commercial domains annotated along a 5-level taxonomy to test document classification under realistic business conditions.
FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while producing false positives on real damage.
GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrievals for integration into IMERG V08.
Entangled states of local dimension d enable strictly higher probability of agreeing on a common frequency band than the optimal classical strategy for sufficiently large safe bands d and spectrum size n, with an explicit 5.4% asymptotic advantage for d=2 using one Bell pair.
ClawNet digitizes human collaborative relationships into a network of identity-governed AI agents that collaborate on behalf of their owners through a central orchestrator enforcing binding and verification.
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
NeuroMAS reframes multi-agent language systems as neural architectures where LLM agents learn coordination via reinforcement learning rather than predefined roles.
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
Frontier AI models can detect evaluation settings and alter their behavior, so standard test scores do not reliably support safety conclusions.
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Twelve frontier AI safety frameworks score between 8% and 34% on adapted risk-management criteria, with a median of 18%, leaving them too vague to serve as reliable external accountability mechanisms.
citing papers explorer
-
ABRA: Agent Benchmark for Radiology Applications
ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
-
Containment Verification: AI Safety Guarantees Independent of Alignment
Containment verification proves that an agentic framework can enforce safety boundaries against any output from an unconstrained AI model by mechanized forward-simulation refinement in Dafny.
-
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
-
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
WildTableBench is the first QA benchmark for naturally occurring table images, where 21 multimodal models were evaluated and only one exceeded 50% accuracy.
-
Can Coding Agents Reproduce Findings in Computational Materials Science?
AutoMat benchmark shows current LLM coding agents achieve at most 54.1% success when reproducing computational materials science claims from papers.
-
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practical agents, and oracle knowledge.
-
Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI
An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month leads across 23 Sierra Nevada sites.
-
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical capability.
-
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
CrackMeBench: Binary Reverse Engineering for Agents
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
-
Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy
MMM-Bench supplies 5,990 multi-modal documents from 12 commercial domains annotated along a 5-level taxonomy to test document classification under realistic business conditions.
-
FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence
FraudBench shows that current multimodal LLMs and specialized AI-image detectors often fail to spot AI-generated fake damage in refund evidence, with true positive rates frequently below 50% on synthetic subsets while producing false positives on real damage.
-
GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products
GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrievals for integration into IMERG V08.
-
Quantum Advantage for Coordinated Frequency Selection Against Distributed Jammers
Entangled states of local dimension d enable strictly higher probability of agreeing on a common frequency band than the optimal classical strategy for sufficiently large safe bands d and spectrum size n, with an explicit 5.4% asymptotic advantage for d=2 using one Bell pair.
-
ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation
ClawNet digitizes human collaborative relationships into a network of identity-governed AI agents that collaborate on behalf of their owners through a central orchestrator enforcing binding and verification.
-
Design and Report Benchmarks for Knowledge Work
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
-
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
NeuroMAS reframes multi-agent language systems as neural architectures where LLM agents learn coordination via reinforcement learning rather than predefined roles.
-
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
-
The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested
Frontier AI models can detect evaluation settings and alter their behavior, so standard test scores do not reliably support safety conclusions.
-
Muon Does Not Converge on Convex Lipschitz Functions
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
-
Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Evaluating AI Providers' Frontier Safety Frameworks
Twelve frontier AI safety frameworks score between 8% and 34% on adapted risk-management criteria, with a median of 18%, leaving them too vague to serve as reliable external accountability mechanisms.
-
Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles
Infini-News delivers a cleaned, enriched, and efficiently queryable index over the full CC-News archive with language and country attribution for 1.35 billion articles.
-
A Joint Synthetic Housing-Household Inventory
A framework integrates synthetic population generation from ACS PUMS, deep contrastive learning for housing-household compatibility, and hierarchical optimization to produce a joint inventory that matches block-group demographics and spatial patterns in coastal North Carolina.
-
An Executable Benchmarking Suite for Tool-Using Agents
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
-
From Code-Centric to Intent-Centric Software Engineering: A Reflexive Thematic Analysis of Generative AI, Agentic Systems, and Engineering Accountability
Software engineering is transitioning from code-centric authorship to intent-centric supervision of human-agent systems, where specification, verification, security, and governance become central.
-
Evolutionary Ensemble of Agents
EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
-
Authorization Propagation in Multi-Agent AI Systems: Identity Governance as Infrastructure
Multi-agent AI creates an authorization propagation problem not solved by prompt injection defenses or classical access control, requiring identity governance as continuously enforced infrastructure.
-
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
-
Claw AI Lab: An Autonomous Multi-Agent Research Team
Claw AI Lab presents an interactive multi-agent platform for autonomous AI research that supports customizable teams, real-time control, and a code harness for experiment integration and result integrity.
-
SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
SparseRL-Sync achieves lossless weight synchronization in large-scale RL by sending only changed parameters, reducing communication volume by roughly 100x under observed 99%+ element-level sparsity.
- Bounding LVR in AMMs via Secant-Tangent Divergence and Collateralized Liquidity Scaling
- Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches
- Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation