SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
Livemedbench: A contamination-free medical benchmark for llms with automated rubric evaluation.arXiv preprint arXiv:2602.10367, 2026
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
MedCTA is a new benchmark with 107 real-world tasks and process-aware metrics that shows frontier multimodal models remain brittle at autonomous tool selection, execution, and trajectory completion in clinical settings.
MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.
citing papers explorer
-
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory
SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
-
MedCTA: A Benchmark for Clinical Tool Agents
MedCTA is a new benchmark with 107 real-world tasks and process-aware metrics that shows frontier multimodal models remain brittle at autonomous tool selection, execution, and trajectory completion in clinical settings.
-
When Medical Safety Alignment Fails: A Benchmark for Evaluating LLMs on High-Risk Medical Queries
MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.
-
LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.