{"total":13,"items":[{"citing_arxiv_id":"2605.22875","ref_index":52,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RMA: an Agentic System for Research-Level Mathematical Problems","primary_cat":"cs.AI","submitted_at":"2026-05-20T04:54:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19035","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On","primary_cat":"cs.AI","submitted_at":"2026-05-18T18:57:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Argues that trustworthiness in Agent-to-Agent networks requires a new conceptual framework with four design pillars baked in from the beginning, as retrofitting existing single-agent methods is insufficient.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17169","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Responsible Agentic AI Requires Explicit Provenance","primary_cat":"cs.AI","submitted_at":"2026-05-16T21:56:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Explicit provenance across the full agentic AI lifecycle is the necessary condition for making responsibility computable and actionable.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11853","ref_index":1,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-12T09:38:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings. 1 Introduction Large language models (LLMs) are increasingly deployed as agents for complex, multi-step tasks [1, 2, 3]. These agents typically operate through multi-turn interactions with external en- vironments, interleaving reasoning with tool use such as retrieval or code execution [ 4]. In such settings, correctness is often determined only at the end of an interaction through a verifiable outcome reward, making supervision naturally trajectory-level [5, 6]."},{"citing_arxiv_id":"2605.15215","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10057","ref_index":20,"ref_count":3,"confidence":0.55,"is_internal_anchor":false,"paper_title":"STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T06:34:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STAR presents a failure-aware routing framework using a state-conditioned transition policy and an agent routing matrix combining expert routes with learned recoveries from execution traces to improve multi-agent spatiotemporal reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"recovery analyses demonstrate that typed failure-aware routing improves robustness especially on executions that deviate from the nominal path. 2 2 Related Work Tool-augmented reasoning and self-correction.Tool-augmented language models extend LLMs with access to external tools, APIs, code execution, and symbolic or numerical solvers during reasoning [20, 31, 12, 16, 4]. These systems improve performance on tasks requiring computation, retrieval, planning, or environment interaction. Related work on self-correction and reflection further improves reasoning by asking models to critique, revise, or retry unsuccessful outputs [18, 13, 23]. However, tool choice and recovery are often still handled through free-form generation, textual"},{"citing_arxiv_id":"2605.09806","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-10T23:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682-17690, 2024. [6] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. [7] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539- 68551, 2023. [8] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec"},{"citing_arxiv_id":"2605.09330","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory","primary_cat":"cs.LG","submitted_at":"2026-05-10T05:04:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"target the calibration mechanism, demonstrating that representation-level calibration can address spurious correlations in agentic memory in a principled and practical way. Code is released at https://anonymous.4open.science/r/Spurious_Correlation-A830. 2 Related Work LLM-as-agent.The deployment of LLMs as autonomous agents has extended their role beyond text generation toward reasoning, planning, and acting in complex environments [33, 58, 60], with multi-agent architectures further enabling collaboration and debate among specialized agents [5, 53]. Building on this paradigm, recent work has applied agentic LLMs to high-stakes domains, including clinical diagnostics and digital health [23, 24, 65], cybersecurity threat intelligence and blue-teaming [17, 22, 45], and telecom resilience via causal digital twins [ 41]."},{"citing_arxiv_id":"2605.08715","ref_index":44,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:55:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.","context_count":2,"top_context_role":"method","top_context_polarity":"use_method","context_text":"gradient signal during the early phase before the policy learns the schema. We optimize R via GRPO, applying two adaptations specific to our coarse-to-fine setup: (i) we anchor the reference policy πref at the Stage 1 BPPO checkpoint πθ1 so that the KL regularizer pulls πθ back toward the risk-anticipation prior learned in Stage 1; (ii) we estimate the KL divergence with the low-variance k3 estimator ˆDKL(πθ∥πref) [44], which is non-negative by construction and reduces gradient noise on long-trajectory rollouts. With these adaptations, the RL objective is formulated as: LGRPO(θ) =−E h min ρj,t(θ)A j,clip(ρ j,t(θ),1−ϵ,1 +ϵ)A j \u0001i +β KL ˆDKL πθ ∥π θ1 \u0001 ,(12) with token-level importance ratio ρj,t(θ) and πref anchored at πθ1 to prevent drift from the risk- anticipation prior."},{"citing_arxiv_id":"2604.24594","ref_index":29,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Skill Retrieval Augmentation for Agentic AI","primary_cat":"cs.CL","submitted_at":"2026-04-27T15:19:59+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20398","ref_index":36,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-04-22T10:04:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03088","ref_index":52,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses","primary_cat":"cs.SE","submitted_at":"2026-04-03T15:11:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"ACM SIGPLAN Notices, 53(8):50-63, 2018. [49] Manuel Serrano. Of javascript aot compilation performance.Proceed- ings of the ACM on Programming Languages, 5(ICFP):1-30, 2021. [50] Skills.sh. Skills.sh: Agent skills registry.https://skills.sh, 2025. Ac- cessed: 2026-03-22. [51] Bjarne Stroustrup.The C++ programming language. Pearson Education, 2013. [52] Toshio Suganuma, Takeshi Ogasawara, Mikio Takeuchi, Toshiaki Ya- sue, Motohiro Kawahito, Kazuaki Ishizaki, Hideaki Komatsu, and Toshio Nakatani. Overview of the ibm java just-in-time compiler. IBM systems Journal, 39(1):175-193, 2000. [53] Bill Venners.Inside the Java Virtual Machine. McGraw-Hill, 1998. [54] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen"},{"citing_arxiv_id":"2604.02678","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Eligibility-Aware Evidence Synthesis: An Agentic Framework for Clinical Trial Meta-Analysis","primary_cat":"stat.ME","submitted_at":"2026-04-03T03:18:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EligMeta automates trial discovery from registries and incorporates eligibility similarity into meta-analysis weighting to yield population-aligned pooled estimates, as shown by recovering all guideline trials in one case and shifting a risk ratio in another.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tic systems: balancing the flexibility of LLMs with the reproducibility requirements of statistical analysis. 2 Methods EligMeta transforms free-text clinical queries into cohort-specific, reproducible meta-analytic es- timates through a structured, multi-stage pipeline. Current agentic systems face challenges in balancing flexibility with reproducibility. Pre-defined function-calling approaches [20] ensure con- sistency but cannot flexibly handle the heterogeneity and complexity of clinical trial registries, while end-to-end agentic coding systems [21, 22] offer adaptability but at the cost of reproducibility and computational efficiency. EligMeta addresses these limitations through a hybrid architecture that combines LLM-based reasoning with deterministic execution of numerically critical operations."}],"limit":50,"offset":0}