{"total":10,"items":[{"citing_arxiv_id":"2605.23330","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Security, Privacy, and Ethical Risks in OpenClaw","primary_cat":"cs.CR","submitted_at":"2026-05-22T07:45:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper analyzes security, privacy, and ethical risks in the OpenClaw AI agent system arising from its architecture, storage, tool use, and integrations, arguing these form major barriers to trustworthy adoption.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21694","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents","primary_cat":"cs.CR","submitted_at":"2026-05-20T19:52:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PocketAgents introduces a manifest-driven library for LLM-based autonomous defense agents, evaluated in 18 closed-loop trials against a DarkSide-inspired attack where 13 trials produced validated blocking actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20520","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Open-World Evaluations for Measuring Frontier AI Capabilities","primary_cat":"cs.AI","submitted_at":"2026-05-19T21:42:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19149","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents","primary_cat":"cs.CL","submitted_at":"2026-05-18T22:03:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12078","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes","primary_cat":"cs.SE","submitted_at":"2026-05-12T13:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pilot study shows agent decision reconstructability varies by vendor SDK regime, with completeness scores from 42.9% to 85.7% and consistent gaps in reasoning traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11665","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Nautilus: From One Prompt to Plug-and-Play Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T07:26:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, Dawn Song, Peter Henderson, Yu Su, Percy Liang, and Arvind Narayanan. Holistic agent leaderboard: The missing infrastructure for AI agent evaluation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=vUaY1t64ZZ. [26] Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan. Towards a science of ai agent reliability, 2026. URL https://arxiv.org/abs/ 2602.16666. [27] Leon Staufer, Kevin Feng, Kevin Wei, Luke Bailey, Yawen Duan, Mick Yang, A. Pinar Ozisik, Stephen Casper, and Noam Kolt. The 2025 ai agent index: Documenting technical and safety"},{"citing_arxiv_id":"2605.10516","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability","primary_cat":"cs.AI","submitted_at":"2026-05-11T13:06:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AI agents powered by large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, from software engineering [9] to complex reasoning tasks [33]. As these agents are increasingly deployed in real-world applications, ensuring their consistent and reliable be- havior across different contexts and input variations has proven challenging [18]. Unlike traditional software systems with deterministic behavior, LLM-based agents can exhibit significant output vari- ance when faced with semantically equivalent inputs, due to sampling stochasticity and sensitivity to superficial variations in prompt formulation [15, 18, 20]. Agent evaluations represent a nascent but rapidly growing field, yet current evaluation methodolo-"},{"citing_arxiv_id":"2605.01428","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hallucinations Undermine Trust; Metacognition is a Way Forward","primary_cat":"cs.CL","submitted_at":"2026-05-02T12:59:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23897","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MarketBench: Evaluating AI Agents as Market Participants","primary_cat":"cs.AI","submitted_at":"2026-04-26T21:48:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from added context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17243","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation","primary_cat":"cs.CV","submitted_at":"2026-04-19T04:04:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ity, showing that multimodal predictions may change even under semantically equivalent input formulations [1, 29]. A smaller body of work has begun to study robustness through consistency and stable behavior across perturbed conditions, rather than accuracy alone, and to explore opti- mization strategies for improving invariance under seman- tic variation [41, 59]. However, these efforts are still cen- tered on general-domain multimodal models. In contrast, EO MLLMs remain rarely studied under realistic coupled image-text perturbations, especially from the perspective of cross-condition consistency and behavioral stability. 3. Robustness Evaluation Clean benchmark performance alone does not determine whether an RS-MLLM remains reliable when the same EO"}],"limit":50,"offset":0}