{"total":23,"items":[{"citing_arxiv_id":"2605.23559","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA","primary_cat":"cs.CV","submitted_at":"2026-05-22T12:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PathNavigate introduces a scan-search-readout routine with surprise-guided low-mag scanning and shared slide memory to improve training-free WSI-VQA accuracy and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23262","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Design and Report Benchmarks for Knowledge Work","primary_cat":"cs.AI","submitted_at":"2026-05-22T06:03:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23204","ref_index":168,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery","primary_cat":"cs.AI","submitted_at":"2026-05-22T03:40:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"from randomized trials, observational studies, diagnostic studies, and real-world data remains difficult to inte- grate; regulatory-grade provenance and uncertainty propagation remain uneven. Medical LLM competence on static QA or benchmark-style reasoning does not establish clinical autonomy, and simulated clinical-agent bench- marks expose the additional difficulty of interactive diagnosis, uncertainty handling, bias, and patient-context de- pendence [167, 168]. The domain remains below robust L4 autonomy because medical validity is inseparable from safety, accountability, governance, and retained human responsibility. 5.7 Economics and Social Sciences Economics and social sciences remain in early-to-middle L2. Their workflows are highly compatible with AI-assisted literature search, data processing, coding, drafting, and exploratory synthesis, yet much harder to close autonomously"},{"citing_arxiv_id":"2605.20506","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reinforcing Human Behavior Simulation via Verbal Feedback","primary_cat":"cs.LG","submitted_at":"2026-05-19T21:23:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20176","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:58:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17829","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Interactive Evaluation Requires a Design Science","primary_cat":"cs.AI","submitted_at":"2026-05-18T04:03:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16679","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?","primary_cat":"cs.CL","submitted_at":"2026-05-15T22:34:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14892","ref_index":297,"ref_count":4,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2026-05-14T14:36:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"and global coordination, a functional orchestration layer for managing tools and specialized agents, and an execution layer for carrying out concrete operations. Similarly, planner-executor architectures [296] separate global reasoning from action execution. TOA [297] structures agents in a hierarchy where parent nodes provide centralized aggregation and reasoning, while leaf nodes maintain autonomy for parallel context processing. GoA [298] expands decentralized execution by enabling non-linear, localized communication among agents, while maintaining a globally predefined topological constraint to ensure centralized goal alignment. AdaptOrch [299] further advances hybrid orchestration by introducing a centralized, task-adaptive orchestrator that dynamically selects the execution topology."},{"citing_arxiv_id":"2605.13542","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation","primary_cat":"cs.AI","submitted_at":"2026-05-13T13:52:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09679","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents","primary_cat":"cs.CV","submitted_at":"2026-05-10T17:57:57+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"evaluation uses a balanced subset of 10,000 questions from the test set, stratified across 42 subtypes. Benchmark Train Bench 2D-view 3D-volume Explicit reasoning Agent Compositional VQA Benchmarks VQA-RAD [34] 3.1K 451 ✓ PMC-VQA [60] 150K 2K ✓ OmniMedVQA [27] - 128K ✓ ✓ MedSG-Bench [58] 188K 9.6K ✓ ✓ 3D-RAD [52] 136K 3.7K ✓ ✓ Agent Benchmarks MedAgentBench [49] - 300 ✓ ✓ AgentClinic [48] - 200 ✓ ✓ MedChain [19] - 12K ✓ ✓ ChestAgentBench [18] - 2.5K ✓ ✓ DeepTumorVQA (Ours) 428K 10K ✓ ✓ ✓ ✓ ✓ Organ Volume Measurement What is the volume of the left kidney? A: 109.9 B: 156.6 C: 211.9 D: 173.6 Organ HU Value Measurement What is the mean HU value of the spleen? A: 131.1 B: 124.7 C: 117.5 D: 104.5 Lesion Volume Measurement"},{"citing_arxiv_id":"2605.06177","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:57:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02240","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments","primary_cat":"cs.AI","submitted_at":"2026-05-04T05:32:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23802","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EndoGov: A knowledge-governed multi-agent expert system for endometrial cancer risk stratification","primary_cat":"cs.MA","submitted_at":"2026-04-26T16:54:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EndoGov uses specialist agents plus a governance layer with hard and soft rule paths to deliver guideline-compliant endometrial cancer risk stratification, reporting 0.943 accuracy and 0.93% logic-violation rate on TCGA-UCEC while outperforming neural baselines on CPTAC-UCEC.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"uncertainty through probabilistic graphical structure rather than hard-priority overrides [28]. LLM-poweredagenticsystemshaveopenednewavenuesforclinicalAI.Foundationmodelsencodebroadclinical knowledge [39] and serve as biomedical reasoning engines [44], while pathology-specific models have scaled to diverse clinical-grade tasks [27, 42, 58, 54]. Multi-agent frameworks have begun to formalize clinical workflows: AgentClinic [37] evaluates AI agents in simulated clinical environments; MedAgent-Pro [55] introduces an evidence- based agentic workflow for multimodal diagnosis; CLARITY [38] provides triage and routing with explicit reasoning chains;Kg4Diagnosis[62]enhancesmulti-agentLLMswithknowledge-graphretrieval;andSmartPath[57]augments pathology co-pilots with reasoning capabilities."},{"citing_arxiv_id":"2604.14475","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve","primary_cat":"cs.AI","submitted_at":"2026-04-15T23:12:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12076","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-13T21:29:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tested models through their BiasBuster framework. Macmillan-Scott and Musolesi [18] assessed a broader set of cognitive biases in LLMs, including the decoy effect and availability heuristic, reporting that models exhibited \"(ir)rationality\" patterns strikingly similar to those documented in the human heuristics-and-biases literature. In the clinical domain, Schmidgall et al. [13] demonstrated that medical LLMs exhibit anchoring bias and framing effects that mirror known patterns of diagnostic error in human physicians. Separately, research on LLM sycophancy [19] has shown that models display a tendency to agree with or affirm user positions, a behav- ior that may interact with bias expression: a sycophantic model might amplify an"},{"citing_arxiv_id":"2604.06846","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors","primary_cat":"cs.CL","submitted_at":"2026-04-08T09:09:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20490","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows","primary_cat":"cs.MA","submitted_at":"2025-09-24T19:08:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RadAgents is a multi-agent framework coupling clinical priors with task-aware multimodal reasoning and radiologist-like workflows, plus grounding and retrieval-augmentation for conflict resolution in chest X-ray interpretation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.07407","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems","primary_cat":"cs.AI","submitted_at":"2025-08-10T16:07:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.04325","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-08-06T11:11:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MedCheck is a lifecycle checklist framework that audits 53 existing medical LLM benchmarks and identifies systemic gaps in clinical fidelity, contamination control, and safety metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.15867","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records","primary_cat":"cs.LG","submitted_at":"2025-07-14T23:31:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RDMA equips small LLMs with abbreviation resolution, phenotype reasoning, and ontology tools to mine rare diseases from EHR notes, outperforming fine-tuned and RAG baselines at up to 10x lower inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.19678","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review","primary_cat":"cs.AI","submitted_at":"2025-04-28T11:08:22+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"hurdles, including concerns over reliability, reproducibility, ethical governance, and safety [29], [30], [31]. Addressing these issues is crucial for ensuring that LLM-based agents can be effectively and responsibly incorporated into clinical practice, underscoring the need for comprehensive evaluation frameworks that can reliably measure their performance across various healthcare tasks [32], [33], [34], [35]. LLM-based agents are emerging as a promising frontier in AI, combining reasoning and action to interact with complex digital environments [36], [37]. Therefore, various approaches have been explored to enhance LLM-based agents, from combining reasoning and acting using techniques like React [38] and Monte Carlo Tree Search [39] to synthesizing high-"},{"citing_arxiv_id":"2503.12605","ref_index":259,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","primary_cat":"cs.CV","submitted_at":"2025-03-16T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Math-Vision [166] 2024 ScienceQA Math T, I MC and Open 3040 OSWorld [257] 2024 Agent Real Comp. Env. T,I Agent Actions 369 AgentClinic [258] 2024 MedicalQA Medical T,I Open 335 MeViS [170] 2023 Referring VOS Common T, V Dense Mask 2K VSIBench [169] 2024 VideoQA Indoor T, V MC and Open 5K HallusionBench [171] 2024 VQA Common T, I Yes-No 1,129 A V-Odyssey [259] 2024 A VQA Common T, V , A MC 4,555 A VHBench [173] 2024 A VQA Common T, V , A Open 5,816 RefA VS-Bench [168] 2024 Referring A VS Common T, V , A Dense Mask 4,770 MMAU [260] 2024 AQA Common T, A MC 10K A VTrustBench [172] 2025 A VQA Common T, V , A MC and Open 600K MIG-Bench [135] 2025 Multi-image Grounding Common T, I BBox 5.89K MedAgentsBench [261] 2025 MedicalQA Medical T, I MC and Open 862"},{"citing_arxiv_id":"2501.04227","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agent Laboratory: Using LLM Agents as Research Assistants","primary_cat":"cs.HC","submitted_at":"2025-01-08T01:58:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Agent Laboratory is an autonomous LLM framework that completes end-to-end research from idea to report and code, with human feedback improving quality and cutting expenses by 84% while reaching competitive ML performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}