pith. sign in

arxiv: 2606.09365 · v2 · pith:UATCQJCVnew · submitted 2026-06-08 · 💻 cs.AI · cs.CL

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Pith reviewed 2026-06-27 16:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords medical agentsskill memoryself-evolving frameworkclinical decision makingmemory governanceutility estimationinteraction trajectoriesprocedural knowledge
0
0 comments X

The pith

Medical agents improve clinical reasoning by building and governing reusable skill memories from interaction feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that medical agents can accumulate compact and reliable experience for long-horizon clinical reasoning by turning interaction trajectories into structured skills rather than storing raw traces. It organizes these skills into a repository with general, task-specific, and action-level branches. Utility is estimated from environment feedback to decide what to retrieve and keep, and a closed loop lifecycle manages the evolution. This addresses the problem of noisy and ungoverned memory in existing systems. If successful, agents could generalize better across cases without needing to update their underlying model weights.

Core claim

SkeMex is a post-deployment self-evolution framework that distills informative interaction trajectories into structured skills encoding reusable procedural knowledge, organizes them into a multi-branch repository, estimates context-dependent utility from environment feedback to guide value-aware retrieval and repository governance, and uses a closed-loop Read-Write-Assess-Govern lifecycle to write new skills, update utilities, promote useful memories, and remove harmful entries, resulting in consistent outperformance of representative memory-based agents in diverse clinical tasks both offline and online, while generalizing across model backbones and supporting transferable skill memory.

What carries the argument

The multi-branch skill memory repository combined with context-dependent utility estimation from environment feedback to drive the Read-Write-Assess-Govern lifecycle.

If this is right

  • Agents achieve better performance than other memory-based systems on clinical tasks in both offline and online settings.
  • The skill memory supports generalization across different model backbones.
  • Skill memories can be transferred between different tasks or settings.
  • The governance process keeps the memory compact by removing harmful or low-utility entries.
  • Procedural knowledge is reused for long-horizon reasoning without redundant raw traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar self-evolving memory could be applied to agent systems in other domains like legal reasoning or scientific discovery where experience accumulates over interactions.
  • By focusing on skills instead of raw data, this might reduce storage and retrieval costs in long-term agent deployments.
  • Accurate utility estimation could help in creating more reliable agents for real-world use by automatically discarding skills that lead to poor outcomes.
  • Future work could test if the same framework works when the agent interacts with real human users rather than simulated environments.

Load-bearing premise

Environment feedback provides a reliable and unbiased signal for judging the usefulness of skills in different contexts.

What would settle it

An experiment where environment feedback is corrupted or biased, then measuring whether the skill governance still leads to performance gains or instead promotes bad skills.

Figures

Figures reproduced from arXiv: 2606.09365 by Fanrui Zhang, Haoran Sun, Kaitao Chen, Lei Liu, Mianxin Liu, Wenjie Li, Xingqi He, Yankai Jiang, Yichen Li, Yujie Zhang, Zekai Lin.

Figure 1
Figure 1. Figure 1: Comparison of (a) conventional memory, (b) training-based methods, and (c) our method. To address this gap, recent work explores memory-augmented medical agents that store and reuse past interactions [25, 29, 47, 71]. Early methods mainly focus on intra-task mem￾ory, which records observations, reasoning steps, and tool interactions within a single clinical case to maintain long-horizon coherence [18, 79].… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SkeMex. Components A–F constitute a closed-loop self-evolving cycle [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Main results on out-of-domain benchmarks (offline). Background colors denote different [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on valuation mod￾ule. Cell values show gaps from SkeMex. In this paper, we introduced SkeMex, a post￾deployment self-evolution framework that enables medical agents to improve through skill-based mem￾ory without updating model weights. By combining reusable skill distillation, utility-driven valuation, and closed-loop memory governance, SkeMex supports reli￾able experience accumulation and r… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis with respect to the number of retrieved skills [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis with respect to retrieval channel weights [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity analysis with respect to the learning window size [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity analysis with respect to the utility smoothing coefficient [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity analysis with respect to the maximum utility update step [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cross backbone generalization results on Qwen3.6-Max-Preview. [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cross backbone generalization results on Kimi-2.6. [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Cross backbone generalization results on GLM-5.1. [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Case 1: Skill-guided avoidance of uninformative search loops (AgentClinic). The agent is tasked with diagnosing a patient presenting with episodic unresponsiveness and facial grimacing. Skill a000010 (Diagnosis-Specific Querying with MedRAG Search) activates a low-information gate: when the clinical vignette lacks discriminating features, the agent is instructed to forgo open-ended literature retrieval an… view at source ↗
Figure 14
Figure 14. Figure 14: Case 2: Skill-guided tool selection for administrative queries (HealthBench). The task involves an ICD-10 billing code question—a non-clinical, administrative query. Two skills collaborate: Skill t000032 identifies the query as administrative and suppresses the clinical literature retrieval tool (medrag_search), while Skill a000001 redirects the agent to use tavily_search for authoritative coding referenc… view at source ↗
Figure 15
Figure 15. Figure 15: Case 3: Skill-guided boundary verification via reflection (LiveMedBench). The agent must determine whether a liver biopsy report meets the diagnostic threshold for cirrhosis. Skill t000041 (Verify Boundary Diagnostic Thresholds After Reflection) instructs the agent to retrieve the explicit staging criteria (Brunt system) and then invoke the reflection tool to systematically verify each criterion against t… view at source ↗
Figure 16
Figure 16. Figure 16: Case 4: Multimodal skill-guided procedure selection (LiveClinBench-MM). Given a colonoscopy image of an infiltrative sigmoid-colon mass alongside a 10-option MCQ, the agent must select the safest endoscopic biopsy approach. Skill a000020 (MedRAG Search for Comparative Procedure Indications) triggers iterative guideline retrieval to compare biopsy techniques, while Skill t000057 subsequently recognizes the… view at source ↗
Figure 17
Figure 17. Figure 17: Case 5 (Failure): Insufficient diagnostic specificity due to premature convergence (AgentClinic). The agent is asked to provide the single most likely diagnosis for a patient with classic B symptoms and supraclavicular lymphadenopathy; the ground truth is Diffuse Large B-Cell Lymphoma (DLBCL). Although Skill a000010 correctly initiates a vignette-first differential workflow, two early tool-format errors (… view at source ↗
Figure 18
Figure 18. Figure 18: The default system prompt that defines the strict format constraints (planning, reasoning, [PITH_FULL_IMAGE:figures/full_fig_p050_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The user prompt template that concatenates conversation history, current request, injected [PITH_FULL_IMAGE:figures/full_fig_p051_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: The forced convergence block injected at the maximum step limit to compel the agent to [PITH_FULL_IMAGE:figures/full_fig_p051_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: The header prompt used to format and inject retrieved experience skills (grouped by [PITH_FULL_IMAGE:figures/full_fig_p052_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: The task classifier prompt used to assign the current medical query to a high-level, action [PITH_FULL_IMAGE:figures/full_fig_p052_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: The pre-screen prompt used during retrieval to select the most semantically relevant [PITH_FULL_IMAGE:figures/full_fig_p053_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: The trajectory analysis prompt for binary-outcome datasets, used to evaluate skill adoption [PITH_FULL_IMAGE:figures/full_fig_p054_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: The trajectory analysis prompt for rubric-based datasets, designed to attribute specific [PITH_FULL_IMAGE:figures/full_fig_p055_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: The mutation prompt that converts extracted patterns into concrete skill drafts (CREATE) [PITH_FULL_IMAGE:figures/full_fig_p056_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: The draft review prompt acting as a governance gatekeeper to evaluate the novelty, quality, [PITH_FULL_IMAGE:figures/full_fig_p057_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: The merge prompt used to consolidate two semantically similar or overlapping skills into [PITH_FULL_IMAGE:figures/full_fig_p058_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: The system prompt for the HealthBench automatic grader, instructing the LLM to [PITH_FULL_IMAGE:figures/full_fig_p059_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: The system prompt for the LiveMedBench automatic grader, tailored for evaluating [PITH_FULL_IMAGE:figures/full_fig_p059_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: The user prompt template for both HealthBench and LiveMedBench graders, supplying [PITH_FULL_IMAGE:figures/full_fig_p059_31.png] view at source ↗
read the original abstract

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SkeMex, a post-deployment self-evolution framework for medical agents. It distills interaction trajectories into structured skills organized in a multi-branch repository (general, task-specific, action-level), estimates context-dependent utility from environment feedback to guide retrieval and governance, and implements a closed-loop Read-Write-Assess-Govern lifecycle for continual evolution without model weight updates. Experiments across diverse clinical tasks report consistent outperformance over representative memory-based agents in offline and online settings, with generalization across model backbones and transferable skill memory.

Significance. If the results hold, the work could meaningfully advance adaptive medical AI systems by enabling compact, governable experience accumulation for long-horizon clinical reasoning. The emphasis on skill distillation over raw traces and the public release of data/code strengthen reproducibility and practical impact.

major comments (2)
  1. [§3 (Method), Assess/Govern stages] §3 (Method), Assess/Govern stages: The central claim that context-dependent utility estimated from environment feedback reliably drives promotion, retention, and removal decisions lacks a concrete description of the utility estimator, any debiasing procedures, or handling of noisy/delayed clinical feedback. This is load-bearing for the governance loop and the headline outperformance result; without it, observed gains could be artifacts of the simulation environments rather than robust skill evolution.
  2. [§5 (Experiments)] §5 (Experiments): The reported consistent outperformance across tasks and backbones requires ablations that isolate the utility-based governance from skill distillation alone, plus statistical significance testing and failure-case analysis (e.g., retention of low-value skills). These are needed to substantiate that the closed-loop mechanism, rather than other factors, drives the gains.
minor comments (2)
  1. [Figure 2] Figure 2 (repository diagram): The multi-branch structure and Read-Write-Assess-Govern flow would be clearer with explicit pseudocode or an expanded legend for the utility update rule.
  2. [§4 (Implementation)] §4 (Implementation): The description of how skills are distilled from trajectories could include more detail on the prompting templates or extraction heuristics used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments identify areas where additional clarity and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3 (Method), Assess/Govern stages] §3 (Method), Assess/Govern stages: The central claim that context-dependent utility estimated from environment feedback reliably drives promotion, retention, and removal decisions lacks a concrete description of the utility estimator, any debiasing procedures, or handling of noisy/delayed clinical feedback. This is load-bearing for the governance loop and the headline outperformance result; without it, observed gains could be artifacts of the simulation environments rather than robust skill evolution.

    Authors: We agree that the current description of the utility estimator in §3 is high-level and requires expansion to fully substantiate the governance mechanism. In the revised manuscript we will add an explicit formulation of the context-dependent utility computation (including the precise aggregation of environment feedback signals such as task success, step efficiency, and error recovery), along with a dedicated subsection on potential biases and mitigation strategies. We will also discuss the deterministic nature of feedback in the current simulation environments and outline planned extensions for handling noisy or delayed real-world clinical signals. These additions will clarify how promotion, retention, and removal decisions are driven and reduce the risk that gains are simulation-specific artifacts. revision: yes

  2. Referee: [§5 (Experiments)] §5 (Experiments): The reported consistent outperformance across tasks and backbones requires ablations that isolate the utility-based governance from skill distillation alone, plus statistical significance testing and failure-case analysis (e.g., retention of low-value skills). These are needed to substantiate that the closed-loop mechanism, rather than other factors, drives the gains.

    Authors: We concur that isolating the contribution of the Assess/Govern stages is essential. The revision will include new ablation experiments that disable the utility-based governance while retaining skill distillation and the Read-Write components, allowing direct comparison of performance deltas. We will also report statistical significance (paired t-tests or Wilcoxon tests across repeated runs with different seeds) and add a failure-case analysis subsection that examines instances of low-value skill retention or erroneous removal. These results will be presented alongside the existing offline and online evaluations to demonstrate that the closed-loop lifecycle, rather than distillation alone, accounts for the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is self-contained via external experiments

full rationale

The paper introduces SkeMex as a post-deployment framework with a Read-Write-Assess-Govern lifecycle that distills trajectories into skills and governs them via environment feedback for utility. No equations, parameter fits, derivations, or self-citations appear in the abstract or description. Outperformance claims rest on experiments across clinical tasks, model backbones, and offline/online settings rather than any reduction of outputs to inputs by construction. The central claims are empirically grounded and independent of self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond the high-level description of the framework itself; the skill repository and utility estimator are methodological constructs rather than new physical or mathematical entities.

axioms (1)
  • domain assumption Interaction trajectories contain extractable procedural knowledge that can be represented as reusable skills without critical loss of context.
    Implicit in the distillation step; required for the memory to be compact yet effective.

pith-pipeline@v0.9.1-grok · 5802 in / 1214 out tokens · 21104 ms · 2026-06-27T16:43:41.947729+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

132 extracted references · 25 linked inside Pith

  1. [1]

    Building agents with skills: Equipping agents for specialized work, 2026

    Anthropic. Building agents with skills: Equipping agents for specialized work, 2026

  2. [2]

    Claude Sonnet 4.6.https://www.anthropic.com/claude/sonnet, 2026

    Anthropic. Claude Sonnet 4.6.https://www.anthropic.com/claude/sonnet, 2026

  3. [3]

    Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

  4. [4]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    The clinical reasoning process.Medical education, 21(2):86–91, 1987

    Howard S Barrows and Paul J Feltovich. The clinical reasoning process.Medical education, 21(2):86–91, 1987

  6. [6]

    The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004

    Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004

  7. [7]

    Benchmarking large language models on answering and explaining challenging medical questions

    Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. Benchmarking large language models on answering and explaining challenging medical questions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3563–3599, 2025

  8. [8]

    Baichuan-m2: Scaling medical capability with large verifier system.arXiv preprint arXiv:2509.02208, 2025

    Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, et al. Baichuan-m2: Scaling medical capability with large verifier system.arXiv preprint arXiv:2509.02208, 2025

  9. [9]

    A guide to deep learning in healthcare.Nature medicine, 25(1):24–29, 2019

    Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, V olodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare.Nature medicine, 25(1):24–29, 2019

  10. [10]

    Evolving medical imaging agents via experience-driven self-skill discovery.arXiv preprint arXiv:2603.05860, 2026

    Lin Fan, Pengyu Dai, Zhipeng Deng, Haolin Wang, Xun Gong, Yefeng Zheng, and Yafei Ou. Evolving medical imaging agents via experience-driven self-skill discovery.arXiv preprint arXiv:2603.05860, 2026

  11. [11]

    Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

    Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025

  12. [12]

    A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

  13. [13]

    A new era of intelligence with gemini 3, 2025

    Google. A new era of intelligence with gemini 3, 2025

  14. [14]

    Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738, 2023

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738, 2023

  15. [15]

    Ds-agent: Automated data science by empowering large language models with case-based reasoning

    Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: Automated data science by empowering large language models with case-based reasoning. arXiv preprint arXiv:2402.17453, 2024

  16. [16]

    Optimizing case-based reasoning system for functional test script generation with large language models

    Siyuan Guo, Huiwu Liu, Xiaolong Chen, Yuming Xie, Liang Zhang, Tao Han, Hechang Chen, Yi Chang, and Jun Wang. Optimizing case-based reasoning system for functional test script generation with large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 4487–4498, 2025

  17. [17]

    Gsem: Graph- based self-evolving memory for experience augmented clinical reasoning.arXiv preprint arXiv:2603.22096, 2026

    Xiao Han, Yuzheng Fan, Sendong Zhao, Haochun Wang, and Bing Qin. Gsem: Graph- based self-evolving memory for experience augmented clinical reasoning.arXiv preprint arXiv:2603.22096, 2026. 11

  18. [18]

    The landscape of medical agents: A survey

    Xiaobin Hu, Yunhang Qian, Jiaquan Yu, Jingjing Liu, Xiaozhong Ji, Chengming Xu, Peng Tang, Chengming Xu, Peng Tang, Jiawei Liu, et al. The landscape of medical agents: A survey. 2026

  19. [19]

    Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

    Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

  20. [20]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  21. [21]

    Stella: Self-evolving llm agent for biomedical research.arXiv preprint arXiv:2507.02004, 2025

    Ruofan Jin, Zaixi Zhang, Mengdi Wang, and Le Cong. Stella: Self-evolving llm agent for biomedical research.arXiv preprint arXiv:2507.02004, 2025

  22. [22]

    Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  23. [23]

    An introduction to case-based reasoning.Artificial intelligence review, 6(1):3–34, 1992

    Janet L Kolodner. An introduction to case-based reasoning.Artificial intelligence review, 6(1):3–34, 1992

  24. [24]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  25. [25]

    Patient-zero: A unified framework for real-record-free patient agent generation.arXiv e-prints, pages arXiv–2509, 2025

    Yunghwei Lai, Weizhi Ma, and Yang Liu. Patient-zero: A unified framework for real-record-free patient agent generation.arXiv e-prints, pages arXiv–2509, 2025

  26. [26]

    Depression diagnosis dialogue simulation: self-improving psychiatrist with tertiary memory

    Kunyao Lan, Bingrui Jin, Zichen Zhu, Siyuan Chen, Shu Zhang, Kenny Q Zhu, and Mengyue Wu. Depression diagnosis dialogue simulation: self-improving psychiatrist with tertiary memory. arXiv preprint arXiv:2409.15084, 2024

  27. [27]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  28. [28]

    Mmedagent: Learning to use medical tools with multi-modal agent

    Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024

  29. [29]

    Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

    Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, et al. Agent hospital: A simulacrum of hospital with evolvable medical agents.arXiv preprint arXiv:2405.02957, 2024

  30. [30]

    Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems, 37:28858–28888, 2024

    Shuyue S Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S Ilgen, Emma Pierson, Pang W Koh, and Yulia Tsvetkov. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems, 37:28858–28888, 2024

  31. [31]

    Memos: An operating system for memory-augmented generation (mag) in large language models.arXiv preprint arXiv:2505.22101, 2025

    Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, et al. Memos: An operating system for memory-augmented generation (mag) in large language models.arXiv preprint arXiv:2505.22101, 2025

  32. [32]

    Deepseek-v3

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  33. [33]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024. 12

  34. [34]

    Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

  35. [35]

    Kimi K2.6: Advancing Open-Source Coding

    Moonshot AI. Kimi K2.6: Advancing Open-Source Coding. https://www.kimi.com/blog/ kimi-k2-6, 2026

  36. [36]

    Replication and analysis of ebbinghaus’ forgetting curve.PloS one, 10(7):e0120644, 2015

    Jaap MJ Murre and Joeri Dros. Replication and analysis of ebbinghaus’ forgetting curve.PloS one, 10(7):e0120644, 2015

  37. [37]

    Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

    Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

  38. [38]

    New embedding models and api updates, 2024

    OpenAI. New embedding models and api updates, 2024

  39. [39]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  40. [40]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

  41. [41]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning, pages 248–260. PMLR, 2022

  42. [42]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  43. [43]

    Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

    Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

  44. [44]

    Qwen3.6-Max-Preview: Smarter, sharper, still evolving, April 2026

    Qwen Team. Qwen3.6-Max-Preview: Smarter, sharper, still evolving, April 2026

  45. [45]

    Qwen3.6-Plus: Towards real world agents, April 2026

    Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026

  46. [46]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  47. [47]

    Healthcare agent: eliciting the power of large language models for medical consultation.npj Artificial Intelligence, 1(1):24, 2025

    Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, Pingbo Xu, and Dacheng Tao. Healthcare agent: eliciting the power of large language models for medical consultation.npj Artificial Intelligence, 1(1):24, 2025

  48. [48]

    Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960, 2024

  49. [49]

    Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

  50. [50]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  51. [51]

    Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

    Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C Ho, Carl Yang, and May Dongmei Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 22315–22339, 2024. 13

  52. [52]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  53. [53]

    Welcome to the era of experience.Google AI, 1:11, 2025

    David Silver and Richard S Sutton. Welcome to the era of experience.Google AI, 1:11, 2025

  54. [54]

    Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

  55. [55]

    Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025

  56. [56]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

  57. [57]

    Dynamic cheatsheet: Test-time learning with adaptive memory

    Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dynamic cheatsheet: Test-time learning with adaptive memory. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7080–7106, 2026

  58. [58]

    Agent kb: Leveraging cross-domain experience for agentic problem solving.arXiv preprint arXiv:2507.06229, 2025

    Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, et al. Agent kb: Leveraging cross-domain experience for agentic problem solving.arXiv preprint arXiv:2507.06229, 2025

  59. [59]

    Medagents: Large language models as collaborators for zero-shot medical reasoning

    Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. InFindings of the Association for Computational Linguistics: ACL 2024, pages 599–621, 2024

  60. [60]

    Tavily AI GitHub Organization.https://github.com/tavily-ai, 2026

    Tavily AI. Tavily AI GitHub Organization.https://github.com/tavily-ai, 2026

  61. [61]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  62. [62]

    High-performance medicine: the convergence of human and artificial intelligence

    Eric J Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine, 25(1):44–56, 2019

  63. [63]

    Episodic and semantic memory.Organization of memory, 1(381-403):1, 1972

    Endel Tulving et al. Episodic and semantic memory.Organization of memory, 1(381-403):1, 1972

  64. [64]

    V oyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  65. [65]

    Skill-sd: Skill-conditioned self-distillation for multi-turn llm agents.arXiv preprint arXiv:2604.10674, 2026

    Hao Wang, Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, et al. Skill-sd: Skill-conditioned self-distillation for multi-turn llm agents.arXiv preprint arXiv:2604.10674, 2026

  66. [66]

    Augmenting language models with long-term memory.Advances in Neural Information Processing Systems, 36:74530–74543, 2023

    Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory.Advances in Neural Information Processing Systems, 36:74530–74543, 2023

  67. [67]

    Liveclin: A live clinical benchmark without leakage.arXiv preprint arXiv:2602.16747, 2026

    Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, and Benyou Wang. Liveclin: A live clinical benchmark without leakage.arXiv preprint arXiv:2602.16747, 2026

  68. [68]

    Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025. 14

  69. [69]

    Medagent-pro: Towards multi- modal evidence-based medical diagnosis via reasoning agentic workflow.arXiv e-prints, pages arXiv–2503, 2025

    Ziyue Wang, Junde Wu, Chang Han Low, and Yueming Jin. Medagent-pro: Towards multi- modal evidence-based medical diagnosis via reasoning agentic workflow.arXiv e-prints, pages arXiv–2503, 2025

  70. [70]

    Agent workflow memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

  71. [71]

    Medco: Medical education copilots based on a multi-agent framework

    Hao Wei, Jianing Qiu, Haibao Yu, and Wu Yuan. Medco: Medical education copilots based on a multi-agent framework. InEuropean Conference on Computer Vision, pages 119–135. Springer, 2024

  72. [72]

    Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H Chi, et al. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

  73. [73]

    Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292, 2023

    Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292, 2023

  74. [74]

    Drugbank 5.0: a major update to the drugbank database for 2018.Nucleic acids research, 46(D1):D1074–D1082, 2018

    David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. Drugbank 5.0: a major update to the drugbank database for 2018.Nucleic acids research, 46(D1):D1074–D1082, 2018

  75. [75]

    Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

  76. [76]

    Medjourney: Benchmark and evaluation of large language models over patient clinical journey.Advances in Neural Information Processing Systems, 37:87621–87646, 2024

    Xian Wu, Yutian Zhao, Yunyan Zhang, Jiageng Wu, Zhihong Zhu, Yingying Zhang, Yi Ouyang, Ziheng Zhang, Huimin Wang, Zhenxi Lin, et al. Medjourney: Benchmark and evaluation of large language models over patient clinical journey.Advances in Neural Information Processing Systems, 37:87621–87646, 2024

  77. [77]

    Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

  78. [78]

    Im- proving retrieval-augmented generation in medicine with iterative follow-up questions

    Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, and Aidong Zhang. Im- proving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium, pages 199–214. World Scientific, 2024

  79. [79]

    A comprehensive survey of ai agents in healthcare.Journal of Biomedical Informatics, page 105045, 2026

    Gelei Xu, Xueyang Li, Yixiong Chen, Yuying Duan, Shuqing Wu, Haoxinran Yu, Ching-Hao Chiu, Juntong Ni, Ningzhi Tang, Toby Jia-Jun Li, et al. A comprehensive survey of ai agents in healthcare.Journal of Biomedical Informatics, page 105045, 2026

  80. [80]

    Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Showing first 80 references.