{"total":12,"items":[{"citing_arxiv_id":"2605.22219","ref_index":42,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval","primary_cat":"cs.AI","submitted_at":"2026-05-21T09:22:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17734","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Harnessing LLM Agents with Skill Programs","primary_cat":"cs.AI","submitted_at":"2026-05-18T01:35:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15508","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"STS: Efficient Sparse Attention with Speculative Token Sparsity","primary_cat":"cs.LG","submitted_at":"2026-05-15T01:05:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11853","ref_index":4,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-12T09:38:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"scheme is especially useful in more challenging long-horizon settings. 1 Introduction Large language models (LLMs) are increasingly deployed as agents for complex, multi-step tasks [1, 2, 3]. These agents typically operate through multi-turn interactions with external en- vironments, interleaving reasoning with tool use such as retrieval or code execution [ 4]. In such settings, correctness is often determined only at the end of an interaction through a verifiable outcome reward, making supervision naturally trajectory-level [5, 6]. As a result, outcome-based reinforce- ment learning (RL) has become a standard approach for improving LLM agents [7]. In particular, group-based methods such as Group Relative Policy Optimization (GRPO) [8] are commonly used to"},{"citing_arxiv_id":"2605.12532","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AgenticAITA: A Proof-Of-Concept About Deliberative Multi-Agent Reasoning for Autonomous Trading Systems","primary_cat":"q-fin.TR","submitted_at":"2026-05-01T16:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AgenticAITA proposes a training-free multi-agent LLM framework for autonomous trading using a deliberative pipeline, Z-score triggers, and safety gates, shown to run correctly in a five-day live dry-run with 157 invocations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22513","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Benchmarking LLM-Driven Network Configuration Repair","primary_cat":"cs.NI","submitted_at":"2026-04-24T12:53:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Cornetto is the first benchmark that synthesizes 231 network misconfiguration problems across topologies of 20-754 nodes and uses formal verification to show that nine state-of-the-art LLMs often introduce regressions and degrade at scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11041","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Topology to Trajectory: LLM-Driven World Models For Supply Chain Resilience","primary_cat":"cs.AI","submitted_at":"2026-04-13T06:14:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ReflectiChain uses latent trajectory rehearsal and retrospective agentic RL inside an LLM world model to raise average step rewards by 250% and restore supply-chain operability from 13.3% to 88.5% on the Semi-Sim benchmark under extreme shocks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06618","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy","primary_cat":"cs.CR","submitted_at":"2026-04-08T02:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06268","ref_index":65,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RAGEN-2: Reasoning Collapse in Agentic RL","primary_cat":"cs.LG","submitted_at":"2026-04-07T04:29:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.23806","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Willful Disobedience: Automatically Detecting Failures in Agentic Traces","primary_cat":"cs.SE","submitted_at":"2026-03-25T00:33:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AgentPex extracts rules from prompts and automatically flags specification violations in agent execution traces that outcome-only benchmarks miss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06205","ref_index":13,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation","primary_cat":"cs.CL","submitted_at":"2026-03-15T19:08:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.06850","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise","primary_cat":"cs.CR","submitted_at":"2025-07-09T13:54:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Testing 18 LLMs found 94.4% vulnerable to direct prompt injection for malware installation, 83.3% to RAG backdoor attacks, and 100% to inter-agent trust exploitation in multi-agent systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}