pith. sign in

arxiv: 2605.29368 · v1 · pith:7N6RBT7Fnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

Pith reviewed 2026-06-29 08:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multi-agent systemslarge language modelssurgical assistanceperioperative workflowTree-of-Thought planningmemory managementclinical retrieval
0
0 comments X

The pith

SURGENT combines Tree-of-Thought planning, multi-department agents, retrieval, and dual memory to outperform baseline LLMs on five perioperative tasks with outputs more aligned to patient histories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that standard large language models fall short for surgical work because of fixed input limits, weak memory handling, and poor traceability of reasoning. SURGENT counters this with a planner that breaks problems into thought branches, agents drawn from different hospital departments, retrieval of guidelines and papers, and a memory layer that keeps both full patient histories and concise current summaries. When tested on case analysis, surgical planning, safety checks, risk assessment, and rehabilitation advice, the system produces outputs that track patient records more closely than either plain LLMs or other medical agent setups. A reader would care because surgical decisions must draw on scattered, lengthy records while remaining explainable and consistent across the full care period. If the gains are real, the design offers a route to assistance tools that stay private and locally runnable.

Core claim

SURGENT integrates a Tree-of-Thought planner, agents from multiple clinical departments, retrieval over guidelines and literature, and a memory design that stores long-term patient histories separately from short-term working summaries. On the five tasks of case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance, this architecture yields recommendations more closely aligned with patient histories than baseline LLMs or prior medical multi-agent systems. The same components support local deployment with DeepSeek, avoiding reliance on external services.

What carries the argument

The novel memory design that separately stores long-term patient histories and short-term working summaries, allowing the system to maintain both complete context and focused current reasoning.

If this is right

  • Recommendations on case analysis, plan simulation, safety monitoring, risk assessment, and rehabilitation guidance track patient histories more closely.
  • Local deployment with a model such as DeepSeek removes dependence on centralized cloud services.
  • Reasoning steps become traceable through the planner, agent exchanges, and retrieved sources.
  • Collaboration across departments occurs via the specialized agents without requiring a single monolithic model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory split could be tested in non-surgical settings that also combine long records with immediate decisions, such as chronic disease management.
  • Whether the alignment gains reduce actual clinical errors or delays would require outcome-linked trials beyond the current task metrics.
  • Hospitals could explore feeding live electronic records directly into the long-term memory to cut manual input time.

Load-bearing premise

The measured gains on the five tasks come from the Tree-of-Thought planner, department agents, retrieval, and memory design rather than from unstated differences in prompting, task selection, or scoring methods.

What would settle it

A controlled re-run of the five tasks that keeps the same prompts, data, and metrics but removes or equalizes the planner, agents, retrieval, and memory components, and finds no advantage for SURGENT, would falsify the claim that those components drive the reported alignment improvement.

Figures

Figures reproduced from arXiv: 2605.29368 by Dongsheng Shi, Huawei Feng, Linlin Wang, Xin Yi, Yongyi Cui, Yue Li.

Figure 1
Figure 1. Figure 1: Challenges of Using a Single LLM in Analyzing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diagram of our surgical multi-agent assistance system, including task definition, planner, memory, department agents, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation Study of ToT intermediate decisions, causing slight performance degradation (e.g., 93.1% → 92.7%). A similar trend is observed for beam width. The system benefits from broader candidate exploration up to width = 5, beyond which the performance saturates or slightly declines (e.g., 93.1% → 92.9%), likely due to the inclusion of noisy or low-quality plans. 5.4 Rationality of Collaboration [PITH_FUL… view at source ↗
read the original abstract

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SURGENT, a multi-agent LLM system for perioperative surgical assistance. It combines a Tree-of-Thought planner, multi-department collaboration agents, retrieval-augmented reasoning over clinical guidelines and biomedical literature, and a novel memory architecture managing long-term patient histories alongside short-term working summaries. The central claim is that the system outperforms baseline LLMs and prior medical multi-agent frameworks across five tasks (case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance), producing recommendations more closely aligned with patient histories; ablation studies are cited to support DeepSeek as a locally deployable backbone enabling privacy-preserving use.

Significance. If supported by properly reported quantitative results, the work could contribute to practical multi-agent systems for high-stakes clinical workflows by tackling LLM limitations in context length, memory consistency, and traceability. The emphasis on local deployment and collaborative agents across surgical departments addresses real deployment constraints in medicine. The current text, however, supplies no metrics or protocols, so significance cannot yet be assessed.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks' is presented without any numerical results, datasets, statistical tests, or evaluation protocols, so the headline experimental claim cannot be evaluated.
  2. [Experimental evaluations section] Experimental evaluations section: no details are supplied on patient cohort size or selection, the metric used to quantify 'more closely aligned with patient histories,' baseline prompting strategies, or inter-rater/blinding procedures; therefore it is impossible to attribute any observed gains to the Tree-of-Thought planner, multi-department agents, retrieval component, or novel memory design rather than unstated differences in prompting or case choice.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'yielding recommendations more closely aligned with patient histories' is imprecise without naming the alignment metric or measurement method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback on the reporting of our experimental claims and protocols. We agree that greater transparency is needed and will revise the manuscript to address these points directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks' is presented without any numerical results, datasets, statistical tests, or evaluation protocols, so the headline experimental claim cannot be evaluated.

    Authors: We agree the abstract claim requires supporting quantitative evidence for proper evaluation. In revision we will add concise numerical highlights (e.g., task-specific accuracy or alignment improvements) drawn from the experimental results, while preserving brevity; full datasets, metrics, and protocols will remain detailed in the body. revision: yes

  2. Referee: [Experimental evaluations section] Experimental evaluations section: no details are supplied on patient cohort size or selection, the metric used to quantify 'more closely aligned with patient histories,' baseline prompting strategies, or inter-rater/blinding procedures; therefore it is impossible to attribute any observed gains to the Tree-of-Thought planner, multi-department agents, retrieval component, or novel memory design rather than unstated differences in prompting or case choice.

    Authors: The referee correctly notes that additional methodological specifics are required. We will expand the experimental evaluations section to report patient cohort size and selection criteria, the precise metric(s) for history alignment, baseline prompting details, and evaluation procedures including inter-rater reliability and blinding. These additions will enable clearer attribution of gains to the individual system components. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical system paper with no reductions to self-defined inputs

full rationale

The paper describes a multi-agent surgical assistance system and reports experimental outperformance on five perioperative tasks. No equations, derivations, fitted parameters, or predictions appear in the provided text. The central claim rests on empirical comparisons rather than any self-definitional, fitted-input, or self-citation reduction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in a way that collapses the result to its own inputs by construction. This matches the default case of a non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5767 in / 1046 out tokens · 25647 ms · 2026-06-29T08:12:45.717636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander Löser, Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. 2025. Longhealth: A question answering benchmark with long clinical documents. Journal of Healthcare Informatics Research (2025), 1–17

  3. [3]

    Manar Aljohani, Jun Hou, Sindhura Kommu, and Xuan Wang. 2025. A compre- hensive survey on the trustworthiness of large language models in healthcare. arXiv preprint arXiv:2502.15871 (2025)

  4. [4]

    Anthropic. 2024. Introducing the next generation of Claude . https://www. anthropic.com/news/claude-3-family

  5. [5]

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  6. [6]

    Kiruthika Balakrishnan, Durgadevi Velusamy, Hana E Hinkle, Zhi Li, Karthikeyan Ramasamy, Hikmat Khan, Srini Ramaswamy, and Pir Masoom Shah. 2025. Arti- ficial Intelligence in Rural Healthcare Delivery: Bridging Gaps and Enhancing Equity through Innovation. arXiv preprint arXiv:2508.11738 (2025)

  7. [7]

    Ma Chang, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. 2024. Agentboard: An analyti- cal evaluation board of multi-turn llm agents. Advances in neural information processing systems 37 (2024), 74325–74362

  8. [8]

    Justin Chen, Swarnadeep Saha, and Mohit Bansal. 2024. ReConcile: Round- Table Conference Improves Reasoning via Consensus among Diverse LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 7066–7085

  9. [9]

    Tobias Czempiel, Magdalini Paschali, Matthias Keicher, Walter Simson, Hubertus Feussner, Seong Tae Kim, and Nassir Navab. 2020. Tecno: Surgical phase recog- nition with multi-stage temporal convolutional networks. In International con- ference on medical image computing and computer-assisted intervention . Springer, 343–352

  10. [10]

    Fabio Dennstädt, Janna Hastings, Paul Martin Putora, Max Schmerder, and Nikola Cihoric. 2025. Implementing large language models in healthcare while balancing control, collaboration, costs and security. NPJ digital medicine 8, 1 (2025), 143

  11. [11]

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch

  12. [12]

    In Forty-first International Conference on Machine Learning

    Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning

  13. [13]

    Andreas Frodl, Andreas Fuchs, Tayfun Yilmaz, Kaywan Izadpanah, Hagen Schmal, and Markus Siegel. 2024. ChatGPT as a Source for Patient Information on Patellofemoral Surgery—A Comparative Study Amongst Laymen, Doctors, and Experts. Clinics and Practice 14, 6 (2024), 2376–2384

  14. [14]

    Xiaojie Gao, Yueming Jin, Yonghao Long, Qi Dou, and Pheng-Ann Heng. 2021. Trans-svnet: Accurate phase recognition from surgical videos via hybrid em- bedding aggregation transformer. In International conference on medical image computing and computer-assisted intervention . Springer, 593–603

  15. [15]

    Omid Kohandel Gargari and Gholamreza Habibi. 2025. Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digital health 11 (2025), 20552076251337177

  16. [16]

    Jonathan Gruber, Mengyun Lin, Hanmo Yang, and Junjian Yi. 2025. China’s social health insurance in the era of rapid population aging. In JAMA Health Forum, Vol. 6. American Medical Association, e251105–e251105

  17. [17]

    Lakshitha Gunasekara, Nicole El-Haber, Swati Nagpal, Harsha Moraliyage, Za- far Issadeen, Milos Manic, and Daswin De Silva. 2025. A Systematic Review of Responsible Artificial Intelligence Principles and Practice. Applied System Innovation 8, 4 (2025), 97

  18. [18]

    Ahmad Guni, Piyush Varma, Joe Zhang, Matyas Fehervari, and Hutan Ashrafian

  19. [19]

    European Surgical Research 65, 1 (2024), 22–39

    Artificial intelligence in surgery: the future is now. European Surgical Research 65, 1 (2024), 22–39

  20. [20]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  21. [21]

    Mohammad Junayed Hasan, Suhra Noor, and Mohammad Ashrafuzzaman Khan

  22. [22]

    arXiv preprint arXiv:2311.01571 (2023)

    Preserving the knowledge of long clinical texts using aggregated ensembles of large language models. arXiv preprint arXiv:2311.01571 (2023)

  23. [23]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

  24. [24]

    International Conference on Learning Representations, ICLR

    MetaGPT: Meta programming for a multi-agent collaborative framework. International Conference on Learning Representations, ICLR

  25. [25]

    Peyman Hosseini, Ignacio Castro, Iacopo Ghinassi, and Matthew Purver. 2024. Efficient solutions for an intriguing failure of llms: Long context window does not mean llms can analyze long sequences flawlessly.arXiv preprint arXiv:2408.01866 (2024). Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui, Huawei Feng, and Linlin Wang

  26. [26]

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel Mc- Duff, Hyeonhoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park

  27. [27]

    Advances in Neural Information Processing Systems 37 (2024), 79410–79452

    Mdagents: An adaptive collaboration of llms for medical decision-making. Advances in Neural Information Processing Systems 37 (2024), 79410–79452

  28. [28]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474

  29. [29]

    Lele Li, Tiantian Du, and Yanping Hu. 2020. The effect of population aging on healthcare expenditure from a healthcare demand perspective among different age groups: Evidence from Beijing City in the People’s Republic of China.Risk Management and Healthcare Policy (2020), 1403–1412

  30. [30]

    Yue Li, Xin Yi, Dongsheng Shi, Gerard De Melo, Xiaoling Wang, and Linlin Wang

  31. [31]

    In Findings of the Association for Computational Linguistics: ACL 2025

    Hierarchical safety realignment: Lightweight restoration of safety in pruned large vision-language models. In Findings of the Association for Computational Linguistics: ACL 2025. 7600–7612

  32. [32]

    Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics 23, 6 (2022), bbac409

  33. [33]

    Ismael Martinez-Nicolas, Daniel Arnal-Velasco, Eva Romero-García, Neus Fabre- gas, Yolanda Sanduende Otero, Irene Leon, Ashish A Bartakke, Javier Silva-Garcia, Anna Rodriguez, Claudia Valli, et al. 2024. Perioperative patient safety recom- mendations: systematic review of clinical practice guidelines.BJS open 8, 6 (2024), zrae143

  34. [34]

    Arnaud Romeo Mbadjeu Hondjeu, Zi Ying Zhao, Luka Newton, Anass Ajenkar, Emily Hladkowicz, Karim Ladha, Duminda N Wijeysundera, and Daniel I McIsaac

  35. [35]

    Canadian Journal of Anesthesia/Journal canadien d’anesthésie(2025), 1–15

    Large language models in perioperative medicine—applications and future prospects: a narrative review: AR Mbadjeu Hondjeu et al. Canadian Journal of Anesthesia/Journal canadien d’anesthésie(2025), 1–15

  36. [36]

    Microsoft. 2024. Responsible AI Principles and Approach. https://www.microsoft. com/en/ai/principles-and-approach

  37. [37]

    Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. 2023. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452 (2023)

  38. [38]

    Chin Siang Ong, Nicholas T Obey, Yanan Zheng, Arman Cohan, and Eric B Schneider. 2024. SurgeryLLM: a retrieval-augmented generation large language model framework for surgical decision support and workflow enhancement. npj Digital Medicine 7, 1 (2024), 364

  39. [39]

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. arXiv preprint arXiv:2310.08560 (2023)

  40. [40]

    Jodyn Platt, Paige Nong, Renée Smiddy, Reema Hamasha, Gloria Carmona Clavijo, Joshua Richardson, and Sharon LR Kardia. 2024. Public comfort with the use of ChatGPT and expectations for healthcare. Journal of the American Medical Informatics Association 31, 9 (2024), 1976–1982

  41. [41]

    Lars Riedemann, Maxime Labonne, and Stephen Gilbert. 2024. The path forward for large language models in medicine is open. npj Digital Medicine 7, 1 (2024), 339

  42. [42]

    Ahmad Y Sheikh and James I Fann. 2019. Artificial intelligence: can information be transformed into intelligence in surgical education? Thoracic surgery clinics 29, 3 (2019), 339–350

  43. [43]

    Dongsheng Shi, Xin Yi, Yue Li, and Linlin Wang. 2026. Benchmarking Large Lan- guage Models for End-to-End Clinical Support in Traditional Chinese Medicine. Expert Systems with Applications (2026), 132267

  44. [44]

    Significant-Gravitas. 2023. AutoGPT: Build, Deploy, and Run AI Agents. https: //github.com/Significant-Gravitas/AutoGPT. GitHub repository

  45. [45]

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al . 2025. Toward expert-level medical question answering with large language models. Nature Medicine 31, 3 (2025), 943–950

  46. [46]

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014)

  47. [47]

    Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. 2024. MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning. In Findings of the Association for Computational Linguistics ACL 2024 . 599–621

  48. [48]

    XAgent Team. 2023. XAgent: An Autonomous Agent for Complex Task Solving

  49. [49]

    Andru P Twinanda, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. 2016. Single-and multi-task architectures for surgical workflow challenge at M2CAI 2016. arXiv preprint arXiv:1610.08844 (2016)

  50. [50]

    Leyao Wang, Zhiyu Wan, Congning Ni, Qingyuan Song, Yang Li, Ellen Clayton, Bradley Malin, and Zhijun Yin. 2024. Applications and concerns of ChatGPT and other conversational large language models in health care: systematic review. Journal of Medical Internet Research 26 (2024), e22769

  51. [51]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

  52. [52]

    Yueyue Wang, Yuyang Li, Shangren Qin, Yuanfeng Kong, Xiyang Yu, Keqiang Guo, and Jiayu Meng. 2020. The disequilibrium in the distribution of the primary health workforce among eight economic regions and between rural and urban areas in China. International Journal for equity in health 19, 1 (2020), 28

  53. [53]

    Zixiang Wang, Yinghao Zhu, Huiya Zhao, Xiaochen Zheng, Dehao Sui, Tian- long Wang, Wen Tang, Yasha Wang, Ewen Harrison, Chengwei Pan, et al. 2025. Colacare: Enhancing electronic health record modeling through large language model-driven multi-agent collaboration. In Proceedings of the ACM on Web Con- ference 2025. 2250–2261

  54. [54]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

  55. [55]

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]

  56. [56]

    Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, and Yueting Zhuang. 2025. Inftythink: Breaking the length limits of long-context reasoning in large language models. arXiv preprint arXiv:2503.06692 (2025)

  57. [57]

    Rui Yang, Yilin Ning, Emilia Keppo, Mingxuan Liu, Chuan Hong, Danielle S Bitterman, Jasmine Chiat Ling Ong, Daniel Shu Wei Ting, and Nan Liu. 2025. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Systems 2, 1 (2025), 2

  58. [58]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems 36 (2023), 11809–11822

  59. [59]

    Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang, Xiaoling Wang, and Liang He. 2025. Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks. Expert Systems with Applications (2025), 129101

  60. [60]

    Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang, Xiaoling Wang, and Liang He. 2025. Unified defense for large language models against jailbreak and fine-tuning attacks in education. arXiv preprint arXiv:2511.14423 (2025)

  61. [61]

    Xuexin Yu, Wei Zhang, and Jersey Liang. 2021. Physician distribution across China’s cities: regional variations. International Journal for Equity in Health 20, 1 (2021), 162

  62. [62]

    A Zambouri. 2007. Preoperative evaluation and preparation for anesthesia and surgery. Hippokratia 11, 1 (2007), 13

  63. [63]

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2025. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems 43, 6 (2025), 1–47

  64. [64]

    Yanli Zheng, Fuhai Han, Shuyu Li, and Wenxing Su. 2024. Current Applications and Future Prospects of Large AI Models in the Medical Field. Journal of Medical Informatics 45, 6 (2024), 24–29. A Scoring Criteria for Expert Evaluation To ensure a consistent and interpretable assessment of the proposed surgical multi-agent collaboration framework, experts eva...