{"total":16,"items":[{"citing_arxiv_id":"2605.20277","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis","primary_cat":"cs.CV","submitted_at":"2026-05-19T04:33:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TIF-GRPO uses integral feedback on pseudo-temporal trajectories to regulate anatomy-aware rewards in RL for clinical faithfulness in volumetric CT analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15561","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-15T03:07:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RoiMAM integrates a training-free ROI Generation Module with Semantic Selective Suppression and a Text Prompt Enhancer to produce a compact VLM that reports 2 percent and 4.6 percent accuracy gains on SLAKE and PMC-VQA at less than 20 percent the size of MedVInT-TD.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14403","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making","primary_cat":"cs.CV","submitted_at":"2026-05-14T05:41:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DermAgent orchestrates seven vision-language tools in a Plan-Execute-Reflect loop with dual-modality retrieval from 413k cases and a critic module to outperform GPT-4o by 17.6% in zero-shot dermatological diagnosis accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10286","ref_index":42,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks","primary_cat":"cs.AI","submitted_at":"2026-05-11T09:46:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09679","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents","primary_cat":"cs.CV","submitted_at":"2026-05-10T17:57:57+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"primarily differ in how they encode volumetric input: RadFM [55] tokenizes 3D patches through a perceiver-based architecture trained on 16M image-text pairs; M3D-LaMed [7] applies a 3D vision transformer over 120K CT-text pairs; Merlin [12] adopts a ResNet-based 3D encoder; CT-CHAT [23] extends this line to full 3D CT report generation.2D VLMsare adapted via finetuning (LLaV A- Med [37], HuatuoGPT-Vision [13], MedGemma [56]) or applied zero-shot (Qwen3-VL [8], Qwen3.5, InternVL3 [15], LLaV A-OneVision [36]). Frontier model APIs (Gemini-3, GPT-5.4) offer strong reasoning but lack systematic 3D CT evaluation. Despite rapid progress, no unified evaluation compares all three categories on identical 3D diagnostic tasks; DeepTumorVQA fills this gap. Medical AI Agents."},{"citing_arxiv_id":"2605.06537","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MedHorizon: Towards Long-context Medical Video Understanding in the Wild","primary_cat":"cs.CV","submitted_at":"2026-05-07T16:37:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20350","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis","primary_cat":"cs.CV","submitted_at":"2026-04-22T08:52:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14316","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-15T18:19:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GazeX uses radiologist gaze trajectories as a behavioral prior during pretraining to generate more accurate and expert-consistent results in chest X-ray report generation, disease grounding, and visual question answering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13756","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging","primary_cat":"cs.CL","submitted_at":"2026-04-15T11:41:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07128","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing","primary_cat":"cs.CV","submitted_at":"2026-04-08T14:21:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity leakage and improved cross-hospital performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.24649","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows","primary_cat":"cs.CV","submitted_at":"2026-03-25T17:33:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.21950","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-02-25T14:33:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MEDSYN benchmark shows MLLMs match experts on differential diagnosis lists but have much larger gaps to final diagnosis selection than humans, due to text overreliance and cross-modal evidence gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.03054","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation","primary_cat":"cs.CV","submitted_at":"2026-01-06T14:37:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.11989","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thought Graph Traversal for Test-time Scaling in Chest X-ray VLLMs","primary_cat":"cs.CV","submitted_at":"2025-06-13T17:46:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new prompting framework called Thought Graph Traversal combined with reasoning budget forcing improves test-time performance of frozen chest X-ray VLLMs on report generation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.05831","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding","primary_cat":"cs.LG","submitted_at":"2025-06-06T07:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HeartcareGPT proposes Dual Stream Projection Alignment (DSPA) on a structure-aware tokenizer for unified ECG signal-image modeling, supported by Heartcare-400K dataset and Heartcare-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.18925","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs","primary_cat":"cs.CL","submitted_at":"2024-12-25T15:12:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}