A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

Andrew J. Hung; Anima Anandkumar; Atharva Deo; Cherine H. Yang; Jasmine Lin; J. Everett Knudsen; Peter Wager; Rafal Kocielnik; Steven Y. Cen; Ujjwal Pasupulety

arxiv: 2605.25440 · v1 · pith:46JN6DATnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI· cs.MA

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

Rafal Kocielnik , J. Everett Knudsen , Steven Y. Cen , Jasmine Lin , Cherine H. Yang , Atharva Deo , Ujjwal Pasupulety , Peter Wager

show 2 more authors

Anima Anandkumar Andrew J. Hung

This is my paper

Pith reviewed 2026-06-29 22:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA

keywords surgical feedbackmulti-agent LLMfeedback quality criteriaLLM-as-a-judgetrainee behavioral changesurgical traininginterpretable criteria

0 comments

The pith

Multi-agent LLMs discover a small set of interpretable criteria that rate surgical feedback quality and predict its effectiveness better than prior content-based frameworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-stage framework that uses multi-agent LLM prompting with surgical domain knowledge to automatically discover a compact set of human-interpretable quality criteria for verbal feedback in the operating room. These criteria, such as Encouraging, Urgent, and Clear, capture delivery aspects that earlier taxonomies overlooked. The criteria are then applied through an LLM-as-a-judge step to score individual feedback instances automatically. Evaluation across 4.2k real trainer feedback examples shows the discovered criteria correlate more strongly with observed trainee behavioral adjustments and trainer approval than previous manual or keyword-based methods. This enables scalable assessment of communication quality without relying on extensive expert annotation.

Core claim

The central claim is that multi-agent LLM prompting combined with surgical domain knowledge injection produces a small set of interpretable scoring criteria which, when used for automated evaluation, outperform prior content-based frameworks at predicting feedback effectiveness as measured by trainee behavioral adjustments and trainer approval.

What carries the argument

Two-stage LLM framework: multi-agent prompting to discover criteria grounded in surgical training, followed by LLM-as-a-judge scoring with those criteria.

If this is right

Feedback quality assessment can be performed automatically at scale in live surgical environments.
Training programs obtain consistent, nuanced metrics for evaluating and refining trainer communication.
Delivery features such as urgency and clarity receive explicit weight alongside content categories.
The method reduces dependence on manual expert annotation for large-scale feedback review.
The same discovery process supplies a reusable template for rating communication in other training contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could extend to feedback evaluation in other apprenticeship settings such as aviation or procedural medicine.
Real-time scoring during surgery might support immediate adjustments to how trainers phrase their comments.
Pairing the criteria with video-tracked trainee actions could test whether higher-rated feedback produces measurable skill gains over time.

Load-bearing premise

The criteria produced by the multi-agent LLM process will remain predictive of trainee behavioral changes and trainer approval when tested on new surgical feedback data.

What would settle it

A fresh collection of feedback instances in which scores from the discovered criteria show no statistically significant correlation with observed trainee adjustments or trainer approval.

read the original abstract

Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition. Yet, assessing the quality of trainer feedback and its effectiveness in influencing trainee behavior during live surgery remains a challenge. Prior studies assessed feedback content relying on extensive manual annotation by expert human raters and focused on developing broad taxonomies that overlook the qualitative aspects of feedback delivery such as clarity or urgency. Limited existing automated methods, including keyword analysis and topic modeling, also fail to capture these nuanced aspects. We introduce a two-stage LLM-based framework that discovers interpretable feedback quality criteria grounded in the context of surgical training. Our method uses multi-agent prompting and surgical domain knowledge injection to discover a small set of human interpretable scoring criteria (e.g., Encouraging, Urgent, Clear). These criteria are then used to automatically score live surgical feedback via an LLM-as-a-judge approach. Evaluation on 4.2k trainer feedback instances demonstrates that our AI-discovered criteria outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments and trainer approval. This work advances scalable, human-aligned assessment of communication quality in the operating room and provides a foundation for improving surgical teaching practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage multi-agent LLM pipeline for discovering surgical feedback criteria is a straightforward extension of existing LLM-as-judge methods, but the predictive claims rest on an unspecified protocol for measuring trainee behavioral adjustments.

read the letter

The main thing here is a two-stage multi-agent LLM framework that first uses prompting and surgical domain knowledge to surface a small set of interpretable criteria like Encouraging, Urgent, and Clear, then applies an LLM judge to score 4.2k feedback instances. The authors report that these criteria beat keyword analysis and topic modeling baselines at predicting feedback effectiveness, measured by observed trainee behavioral adjustments and trainer approval.

The work does a reasonable job framing the practical problem in surgical training and moving past broad manual taxonomies toward automated, context-grounded scoring. The multi-agent step for criteria discovery adds a layer beyond simple prompting and tries to keep the outputs human-readable, which is useful for an applied medical education setting.

The soft spots are concentrated in the evaluation. The stress-test concern holds: the abstract gives no protocol for recording trainee behavioral adjustments, no inter-rater reliability numbers, no time window or controls for case difficulty or trainee level. Without an independent, reproducible definition of that outcome, the reported outperformance cannot be read as strong evidence that the criteria capture real effectiveness rather than confounders. The abstract also omits any statistical methods, sample details, or checks for LLM judge biases, so soundness is difficult to judge.

This paper is mainly for researchers working on automated assessment tools in surgical education or on multi-agent LLM pipelines for domain-specific evaluation. A reader in those niches could extract the criteria-discovery technique as a starting point.

I would send it for peer review because the application is concrete and the method is a clear attempt to improve on the cited baselines, but only if the full version supplies a clear measurement protocol and basic validation stats for the outcomes.

Referee Report

2 major / 1 minor

Summary. The paper introduces a two-stage multi-agent LLM framework that first discovers a small set of human-interpretable surgical feedback quality criteria (e.g., Encouraging, Urgent, Clear) via domain-knowledge injection, then applies an LLM-as-a-judge to score 4.2k live trainer feedback instances. It claims these AI-derived criteria outperform prior content-based frameworks at predicting feedback effectiveness, operationalized as observed trainee behavioral adjustments and trainer approval.

Significance. If the evaluation design and outcome measurement prove robust, the work would offer a scalable, interpretable alternative to manual expert annotation for assessing communication quality in surgical training, with potential downstream uses in feedback improvement and resident skill acquisition.

major comments (2)

[Evaluation / Results] Evaluation section (results on 4.2k instances): the central claim that the discovered criteria 'outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments' cannot be assessed because the measurement protocol for trainee behavioral adjustments is entirely unspecified—no details on video review procedure, post-feedback time window, inter-rater reliability, controls for case difficulty or trainee level, or how adjustments were distinguished from baseline behavior.
[Methods] Methods (LLM-as-a-judge stage): no validation is reported of the LLM judge's scores against independent human raters on the new criteria set, leaving open whether outperformance reflects genuine predictive power or LLM-specific biases that may correlate with the (unspecified) outcome measures.

minor comments (1)

[Abstract / Introduction] Abstract and introduction: the phrase 'surgical domain knowledge injection' is used without a concrete description of the injection mechanism or source material until later sections; a brief forward reference would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the evaluation design and methods. We address each major point below and will revise the manuscript to provide the requested details and validation.

read point-by-point responses

Referee: [Evaluation / Results] Evaluation section (results on 4.2k instances): the central claim that the discovered criteria 'outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments' cannot be assessed because the measurement protocol for trainee behavioral adjustments is entirely unspecified—no details on video review procedure, post-feedback time window, inter-rater reliability, controls for case difficulty or trainee level, or how adjustments were distinguished from baseline behavior.

Authors: We agree that the measurement protocol for trainee behavioral adjustments was not described in sufficient detail. In the revised manuscript we will add a dedicated paragraph in the Evaluation section specifying the video review procedure, the post-feedback observation window, inter-rater reliability coefficients, controls for case difficulty and trainee level, and the operational criteria used to identify behavioral adjustments distinct from baseline performance. revision: yes
Referee: [Methods] Methods (LLM-as-a-judge stage): no validation is reported of the LLM judge's scores against independent human raters on the new criteria set, leaving open whether outperformance reflects genuine predictive power or LLM-specific biases that may correlate with the (unspecified) outcome measures.

Authors: We acknowledge that a direct comparison of the LLM-as-a-judge scores against independent human ratings on the discovered criteria was not reported. We will add a human validation subsection reporting agreement metrics (e.g., Cohen’s kappa) between the LLM scores and ratings provided by surgical experts on a held-out sample of instances, thereby addressing potential bias concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central derivation uses multi-agent LLM prompting plus domain knowledge to discover a small set of scoring criteria (e.g., Encouraging, Urgent, Clear), then applies an LLM-as-a-judge to score 4.2k feedback instances, and finally correlates those scores against separately observed external outcomes (trainee behavioral adjustments and trainer approval). These outcome variables are described as independent observations rather than quantities derived from the LLM criteria themselves. No self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain is present in the abstract or method outline; the empirical outperformance claim rests on external validation metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified assumption that LLMs can reliably extract and apply human-aligned quality criteria in the surgical domain when given domain knowledge prompts.

axioms (1)

domain assumption Multi-agent LLM prompting with injected surgical domain knowledge produces a small set of human-interpretable scoring criteria that generalize to live feedback
Invoked to justify the first stage of the framework that generates criteria such as Encouraging, Urgent, Clear.

pith-pipeline@v0.9.1-grok · 5785 in / 1200 out tokens · 34349 ms · 2026-06-29T22:24:46.206211+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Okay, I see

Agha, R. A., Fowler, A. J. & Sevdalis, N. The role of non-technical skills in surgery.Annals of medicine and surgery4, 422–427 (2015). 22 Prompt Template for Quality-Criteria Discovery System Instruction: You are working in the context ofverbal feedbackdelivered by a trainer to a trainee in a live surgery. The goal of the feedback is to modify trainee thi...

2015
[2]

M., Dedy, N

Bonrath, E. M., Dedy, N. J., Gordon, L. E. & Grantcharov, T. P. Comprehensive surgical coaching enhances surgical skill in the operating room.Annals of surgery 262, 205–212 (2015)

2015
[3]

[feedback line]

Ma, R.et al.Tailored feedback based on clinically relevant performance met- rics expedites the acquisition of robotic suturing skills—an unblinded pilot 23 Prompt Template for Multi-Criteria Feedback Scoring System Instruction: This is verbal FEEDBACK delivered during surgery by a trainer to a trainee. Please rate it given each of the following criteria a...

2022
[4]

M.et al.The surgical autonomy program: a pilot study of social learning theory applied to competency-based neurosurgical education

Haglund, M. M.et al.The surgical autonomy program: a pilot study of social learning theory applied to competency-based neurosurgical education. Neurosurgery88, E345–E350 (2021)

2021
[5]

S., Wanzek, J

Hauge, L. S., Wanzek, J. A. & Godellas, C. The reliability of an instrument for identifying and quantifying surgeons’ teaching in the operating room.The American journal of surgery181, 333–337 (2001)

2001
[6]

Blom, E.et al.Analysis of verbal communication during teaching in the operating room and the potentials for surgical training.Surgical endoscopy21, 1560–1566 (2007)

2007
[7]

D., Ruis, A

D’Angelo, A.-L. D., Ruis, A. R., Collier, W., Shaffer, D. W. & Pugh, C. M. Evaluating how residents talk and what it means for surgical performance in the simulation lab.The American Journal of Surgery220, 37–43 (2020)

2020
[8]

Y.et al.Development of a classification system for live surgical feedback.JAMA Network Open6, e2320702–e2320702 (2023)

Wong, E. Y.et al.Development of a classification system for live surgical feedback.JAMA Network Open6, e2320702–e2320702 (2023)

2023
[9]

Ramprasad, A.et al.Language in the teaching operating room: expressing confidence versus community.Journal of Surgical Education81, 556–563 (2024)

2024
[10]

Kocielnik, R.et al.Human ai collaboration for unsupervised categorization of live surgical feedback.npj Digital Medicine7, 372 (2024)

2024
[11]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure.arXiv preprint arXiv:2203.05794(2022). 24

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

URL https://papers.nips.cc/paper files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets and Benchmarks.html

Zheng, L.et al.Judging llm-as-a-judge with mt-bench and chat- bot arena.Advances in Neural Information Processing Systems36, 46595–46623 (2023). URL https://papers.nips.cc/paper files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets and Benchmarks.html

2023
[13]

Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data.biometrics159–174 (1977)

1977
[14]

McHugh, M. L. Interrater reliability: the kappa statistic.Biochemia Med- ica22, 276–282 (2012). URL https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3900052/. PMID: 23092060

2012
[15]

P., Calkins, C

Quesada, S. P., Calkins, C. & Jeglic, E. L. An examination of the interrater reliability between practitioners and researchers on the static-99.Interna- tional Journal of Offender Therapy and Comparative Criminology58, 1364–1375 (2014)

2014
[16]

Holland, J. R.et al.Reliability of the behaviorally anchored rating scale (bars) for assessing non-technical skills of medical students in simulated scenarios.Medical Education Online27, 2070940 (2022)

2022
[17]

& Blair, R

Liu, T., Yu, H. & Blair, R. H. Stability estimation for unsupervised clustering: A review.Wiley Interdisciplinary Reviews: Computational Statistics14, e1575 (2022)

2022
[18]

Tausczik, Y. R. & Pennebaker, J. W. The psychological meaning of words: Liwc and computerized text analysis methods.Journal of language and social psychology29, 24–54 (2010)

2010
[19]

Gu, J.et al.A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Patel, D.et al.Exploring temperature effects on large language models across various clinical tasks.medRxiv2024–07 (2024)

2024
[21]

& Blei, D

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. & Blei, D. Reading tea leaves: How humans interpret topic models.Advances in neural information processing systems22(2009)

2009
[22]

Lipton, Z. C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue16, 31–57 (2018)

2018
[23]

T., Singh, S

Ribeiro, M. T., Singh, S. & Guestrin, C. ” why should i trust you?” explaining the predictions of any classifier.Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining1135–1144 (2016). 25

2016
[24]

C.et al.Association of a statewide surgical coaching program with clinical outcomes and surgeon perceptions.Annals of surgery273, 1034–1039 (2021)

Greenberg, C. C.et al.Association of a statewide surgical coaching program with clinical outcomes and surgeon perceptions.Annals of surgery273, 1034–1039 (2021)

2021
[25]

Freschi, C.et al.Technical review of the da vinci surgical telemanipulator.The International Journal of Medical Robotics and Computer Assisted Surgery9, 396–406 (2013)

2013
[26]

P., Heneman III, H

Schwab, D. P., Heneman III, H. & DeCotiis, T. A. Behaviorally anchored rating scales: A review of the literature.Academy of Management Proceedings1975, 222–224 (1975)

1975
[27]

& Zedeck, S

Jacobs, R., Kafry, D. & Zedeck, S. Expectations of behaviorally anchored rating scales.Personnel psychology33, 595–640 (1980)

1980
[28]

& Dankelman, J

Van Hove, P., Tuijthof, G., Verdaasdonk, E., Stassen, L. & Dankelman, J. Objec- tive assessment of technical surgical skills.Journal of British Surgery97, 972–987 (2010)

2010
[29]

Haque, T. F.et al.An assessment tool to provide targeted feedback to robotic surgical trainees: development and validation of the end-to-end assessment of suturing expertise (ease).Urology practice9, 532–539 (2022)

2022
[30]

B.et al.Development and validation of an objective scoring tool to evaluate surgical dissection: dissection assessment for robotic technique (dart)

Vanstrum, E. B.et al.Development and validation of an objective scoring tool to evaluate surgical dissection: dissection assessment for robotic technique (dart). Urology practice8, 596–604 (2021)

2021
[31]

S., Reid, M., Matsuo, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners.Advances in neural information processing systems35, 22199–22213 (2022)

2022
[32]

& Jojic, N

Ozturkler, B., Malkin, N., Wang, Z. & Jojic, N. Thinksum: Probabilistic reasoning over sets using large language models.arXiv preprint arXiv:2210.01293(2022)

work page arXiv 2022
[33]

Wang, X.et al.Rationale-augmented ensembles in language models.arXiv preprint arXiv:2207.00747(2022)

work page arXiv 2022
[34]

Jiang, K., Mujtaba, M. M. & Bernard, G. R. Large language model as unsu- pervised health information retriever.Caring is Sharing–Exploiting the Value in Data for Health and Innovation833–834 (2023)

2023
[35]

Wei, J.et al.Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Maharjan, J.et al.Openmedlm: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models.Scientific Reports14, 14156 (2024). 26

2024
[37]

& Wang, Y

Sivarajkumar, S., Kelley, M., Samolyk-Mazzanti, A., Visweswaran, S. & Wang, Y. An empirical evaluation of prompting strategies for large language models in zero- shot clinical natural language processing: algorithm development and validation study.JMIR Medical Informatics12, e55318 (2024)

2024
[38]

Windisch, P.et al.The impact of temperature on extracting information from clinical trial publications using large language models.Cureus16(2024)

2024
[39]

R., Shah, J

Anderson, B. R., Shah, J. H. & Kreminski, M. Homogenization effects of large language models on human creative ideation.Proceedings of the 16th conference on creativity & cognition413–425 (2024)

2024
[40]

Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics20, 53–65 (1987)

1987
[41]

sentence-transformers/all-minilm-l12-v2·hugging face

SBERT.net. sentence-transformers/all-minilm-l12-v2·hugging face. https: //huggingface.co/sentence-transformers/all-MiniLM-L12-v2. (Accessed on 03/24/2024)

2024
[42]

R., Panchal, V

Mishra, A. R., Panchal, V. & Kumar, P. Similarity search based on text embed- ding model for detection of near duplicates.International Journal of Grid and Distributed Computing13, 1871–1881 (2020)

2020
[43]

& Carter, D

Rodier, S. & Carter, D. Online near-duplicate detection of news articles.Proceed- ings of the Twelfth Language Resources and Evaluation Conference1242–1249 (2020)

2020
[44]

& Kumar, A

Tumre, S., Patil, S. & Kumar, A. Improved near-duplicate detection for aggre- gated and paywalled news-feeds.Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 3: Industry Track)979–987 (2025)

2025
[45]

Zhao, K.et al.X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911(2024)

work page arXiv 2024
[46]

Li, D.et al.From generation to judgment: Opportunities and challenges of llm- as-a-judge.arXiv preprint arXiv:2411.16594(2024)

work page arXiv 2024
[47]

& Wood-Doughty, Z

Schroeder, K. & Wood-Doughty, Z. Can you trust llm judgments? reliability of llm-as-a-judge.arXiv preprint arXiv:2412.12509(2024)

work page arXiv 2024
[48]

Pan, Q.et al.Human-centered design recommendations for llm-as-a-judge.Pro- ceedings of the 1st Human-Centered Large Language Modeling Workshop16–29 (2024). 27

2024
[49]

& Groh, G

Mosca, E., Szigeti, F., Tragianni, S., Gallagher, D. & Groh, G. Shap-based explanation methods: a review for nlp interpretability.Proceedings of the 29th international conference on computational linguistics4593–4603 (2022)

2022
[50]

& Zeng, L

King, G. & Zeng, L. Logistic regression in rare events data.Political analysis9, 137–163 (2001)

2001
[51]

Sun, X. & Xu, W. Fast implementation of delong’s algorithm for comparing the areas under correlated receiver operating characteristic curves.IEEE Signal Processing Letters21, 1389–1393 (2014)

2014
[52]

L., Quincy, C., Osserman, J

Campbell, J. L., Quincy, C., Osserman, J. & Pedersen, O. K. Coding in-depth semistructured interviews: Problems of unitization and intercoder reliability and agreement.Sociological methods & research42, 294–320 (2013)

2013
[53]

& Aragon, C

Chinh, B., Zade, H., Ganji, A. & Aragon, C. Ways of qualitative coding: A case study of four strategies for resolving disagreements.Extended abstracts of the 2019 CHI conference on human factors in computing systems1–6 (2019)

2019
[54]

Dong, Q.et al.A survey on in-context learning.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing1107–1128 (2024)

2024
[55]

C., Roberts, D

Watkins, S. C., Roberts, D. A., Boulet, J. R., McEvoy, M. D. & Weinger, M. B. Evaluation of a simpler tool to assess nontechnical skills during simulated critical events.Simulation in Healthcare12, 69–75 (2017)

2017
[56]

J., Garrett, J

Viera, A. J., Garrett, J. M.et al.Understanding interobserver agreement: the kappa statistic.Fam med37, 360–363 (2005)

2005
[57]

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological bulletin70, 213 (1968)

Cohen, J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological bulletin70, 213 (1968)

1968
[58]

OpenAI api (2023)

OpenAI. OpenAI api (2023). URL https://platform.openai.com/docs/ introduction. Online; accessed 07-Aug-2025

2023
[59]

Agglomerativeclustering — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. Agglomerativeclustering — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.cluster. AgglomerativeClustering.html. [Online; accessed 2025-08-07]

2023
[60]

Randomforestclassifier — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. Randomforestclassifier — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. RandomForestClassifier.html. [Online; accessed 2025-08-07]

2023
[61]

Gridsearchcv — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. Gridsearchcv — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.model selection. GridSearchCV.html. [Online; accessed 2025-08-07]. 28

2023
[62]

Stratifiedkfold — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. Stratifiedkfold — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.model selection. StratifiedKFold.html. [Online; accessed 2025-08-07]

2023
[63]

cohen kappa score — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. cohen kappa score — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen kappa score.html. [Online; accessed 2025-08-07]. 29

2023

[1] [1]

Okay, I see

Agha, R. A., Fowler, A. J. & Sevdalis, N. The role of non-technical skills in surgery.Annals of medicine and surgery4, 422–427 (2015). 22 Prompt Template for Quality-Criteria Discovery System Instruction: You are working in the context ofverbal feedbackdelivered by a trainer to a trainee in a live surgery. The goal of the feedback is to modify trainee thi...

2015

[2] [2]

M., Dedy, N

Bonrath, E. M., Dedy, N. J., Gordon, L. E. & Grantcharov, T. P. Comprehensive surgical coaching enhances surgical skill in the operating room.Annals of surgery 262, 205–212 (2015)

2015

[3] [3]

[feedback line]

Ma, R.et al.Tailored feedback based on clinically relevant performance met- rics expedites the acquisition of robotic suturing skills—an unblinded pilot 23 Prompt Template for Multi-Criteria Feedback Scoring System Instruction: This is verbal FEEDBACK delivered during surgery by a trainer to a trainee. Please rate it given each of the following criteria a...

2022

[4] [4]

M.et al.The surgical autonomy program: a pilot study of social learning theory applied to competency-based neurosurgical education

Haglund, M. M.et al.The surgical autonomy program: a pilot study of social learning theory applied to competency-based neurosurgical education. Neurosurgery88, E345–E350 (2021)

2021

[5] [5]

S., Wanzek, J

Hauge, L. S., Wanzek, J. A. & Godellas, C. The reliability of an instrument for identifying and quantifying surgeons’ teaching in the operating room.The American journal of surgery181, 333–337 (2001)

2001

[6] [6]

Blom, E.et al.Analysis of verbal communication during teaching in the operating room and the potentials for surgical training.Surgical endoscopy21, 1560–1566 (2007)

2007

[7] [7]

D., Ruis, A

D’Angelo, A.-L. D., Ruis, A. R., Collier, W., Shaffer, D. W. & Pugh, C. M. Evaluating how residents talk and what it means for surgical performance in the simulation lab.The American Journal of Surgery220, 37–43 (2020)

2020

[8] [8]

Y.et al.Development of a classification system for live surgical feedback.JAMA Network Open6, e2320702–e2320702 (2023)

Wong, E. Y.et al.Development of a classification system for live surgical feedback.JAMA Network Open6, e2320702–e2320702 (2023)

2023

[9] [9]

Ramprasad, A.et al.Language in the teaching operating room: expressing confidence versus community.Journal of Surgical Education81, 556–563 (2024)

2024

[10] [10]

Kocielnik, R.et al.Human ai collaboration for unsupervised categorization of live surgical feedback.npj Digital Medicine7, 372 (2024)

2024

[11] [11]

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure.arXiv preprint arXiv:2203.05794(2022). 24

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

URL https://papers.nips.cc/paper files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets and Benchmarks.html

Zheng, L.et al.Judging llm-as-a-judge with mt-bench and chat- bot arena.Advances in Neural Information Processing Systems36, 46595–46623 (2023). URL https://papers.nips.cc/paper files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets and Benchmarks.html

2023

[13] [13]

Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data.biometrics159–174 (1977)

1977

[14] [14]

McHugh, M. L. Interrater reliability: the kappa statistic.Biochemia Med- ica22, 276–282 (2012). URL https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3900052/. PMID: 23092060

2012

[15] [15]

P., Calkins, C

Quesada, S. P., Calkins, C. & Jeglic, E. L. An examination of the interrater reliability between practitioners and researchers on the static-99.Interna- tional Journal of Offender Therapy and Comparative Criminology58, 1364–1375 (2014)

2014

[16] [16]

Holland, J. R.et al.Reliability of the behaviorally anchored rating scale (bars) for assessing non-technical skills of medical students in simulated scenarios.Medical Education Online27, 2070940 (2022)

2022

[17] [17]

& Blair, R

Liu, T., Yu, H. & Blair, R. H. Stability estimation for unsupervised clustering: A review.Wiley Interdisciplinary Reviews: Computational Statistics14, e1575 (2022)

2022

[18] [18]

Tausczik, Y. R. & Pennebaker, J. W. The psychological meaning of words: Liwc and computerized text analysis methods.Journal of language and social psychology29, 24–54 (2010)

2010

[19] [19]

Gu, J.et al.A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Patel, D.et al.Exploring temperature effects on large language models across various clinical tasks.medRxiv2024–07 (2024)

2024

[21] [21]

& Blei, D

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. & Blei, D. Reading tea leaves: How humans interpret topic models.Advances in neural information processing systems22(2009)

2009

[22] [22]

Lipton, Z. C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.Queue16, 31–57 (2018)

2018

[23] [23]

T., Singh, S

Ribeiro, M. T., Singh, S. & Guestrin, C. ” why should i trust you?” explaining the predictions of any classifier.Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining1135–1144 (2016). 25

2016

[24] [24]

C.et al.Association of a statewide surgical coaching program with clinical outcomes and surgeon perceptions.Annals of surgery273, 1034–1039 (2021)

Greenberg, C. C.et al.Association of a statewide surgical coaching program with clinical outcomes and surgeon perceptions.Annals of surgery273, 1034–1039 (2021)

2021

[25] [25]

Freschi, C.et al.Technical review of the da vinci surgical telemanipulator.The International Journal of Medical Robotics and Computer Assisted Surgery9, 396–406 (2013)

2013

[26] [26]

P., Heneman III, H

Schwab, D. P., Heneman III, H. & DeCotiis, T. A. Behaviorally anchored rating scales: A review of the literature.Academy of Management Proceedings1975, 222–224 (1975)

1975

[27] [27]

& Zedeck, S

Jacobs, R., Kafry, D. & Zedeck, S. Expectations of behaviorally anchored rating scales.Personnel psychology33, 595–640 (1980)

1980

[28] [28]

& Dankelman, J

Van Hove, P., Tuijthof, G., Verdaasdonk, E., Stassen, L. & Dankelman, J. Objec- tive assessment of technical surgical skills.Journal of British Surgery97, 972–987 (2010)

2010

[29] [29]

Haque, T. F.et al.An assessment tool to provide targeted feedback to robotic surgical trainees: development and validation of the end-to-end assessment of suturing expertise (ease).Urology practice9, 532–539 (2022)

2022

[30] [30]

B.et al.Development and validation of an objective scoring tool to evaluate surgical dissection: dissection assessment for robotic technique (dart)

Vanstrum, E. B.et al.Development and validation of an objective scoring tool to evaluate surgical dissection: dissection assessment for robotic technique (dart). Urology practice8, 596–604 (2021)

2021

[31] [31]

S., Reid, M., Matsuo, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners.Advances in neural information processing systems35, 22199–22213 (2022)

2022

[32] [32]

& Jojic, N

Ozturkler, B., Malkin, N., Wang, Z. & Jojic, N. Thinksum: Probabilistic reasoning over sets using large language models.arXiv preprint arXiv:2210.01293(2022)

work page arXiv 2022

[33] [33]

Wang, X.et al.Rationale-augmented ensembles in language models.arXiv preprint arXiv:2207.00747(2022)

work page arXiv 2022

[34] [34]

Jiang, K., Mujtaba, M. M. & Bernard, G. R. Large language model as unsu- pervised health information retriever.Caring is Sharing–Exploiting the Value in Data for Health and Innovation833–834 (2023)

2023

[35] [35]

Wei, J.et al.Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

Maharjan, J.et al.Openmedlm: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models.Scientific Reports14, 14156 (2024). 26

2024

[37] [37]

& Wang, Y

Sivarajkumar, S., Kelley, M., Samolyk-Mazzanti, A., Visweswaran, S. & Wang, Y. An empirical evaluation of prompting strategies for large language models in zero- shot clinical natural language processing: algorithm development and validation study.JMIR Medical Informatics12, e55318 (2024)

2024

[38] [38]

Windisch, P.et al.The impact of temperature on extracting information from clinical trial publications using large language models.Cureus16(2024)

2024

[39] [39]

R., Shah, J

Anderson, B. R., Shah, J. H. & Kreminski, M. Homogenization effects of large language models on human creative ideation.Proceedings of the 16th conference on creativity & cognition413–425 (2024)

2024

[40] [40]

Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics20, 53–65 (1987)

1987

[41] [41]

sentence-transformers/all-minilm-l12-v2·hugging face

SBERT.net. sentence-transformers/all-minilm-l12-v2·hugging face. https: //huggingface.co/sentence-transformers/all-MiniLM-L12-v2. (Accessed on 03/24/2024)

2024

[42] [42]

R., Panchal, V

Mishra, A. R., Panchal, V. & Kumar, P. Similarity search based on text embed- ding model for detection of near duplicates.International Journal of Grid and Distributed Computing13, 1871–1881 (2020)

2020

[43] [43]

& Carter, D

Rodier, S. & Carter, D. Online near-duplicate detection of news articles.Proceed- ings of the Twelfth Language Resources and Evaluation Conference1242–1249 (2020)

2020

[44] [44]

& Kumar, A

Tumre, S., Patil, S. & Kumar, A. Improved near-duplicate detection for aggre- gated and paywalled news-feeds.Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 3: Industry Track)979–987 (2025)

2025

[45] [45]

Zhao, K.et al.X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911(2024)

work page arXiv 2024

[46] [46]

Li, D.et al.From generation to judgment: Opportunities and challenges of llm- as-a-judge.arXiv preprint arXiv:2411.16594(2024)

work page arXiv 2024

[47] [47]

& Wood-Doughty, Z

Schroeder, K. & Wood-Doughty, Z. Can you trust llm judgments? reliability of llm-as-a-judge.arXiv preprint arXiv:2412.12509(2024)

work page arXiv 2024

[48] [48]

Pan, Q.et al.Human-centered design recommendations for llm-as-a-judge.Pro- ceedings of the 1st Human-Centered Large Language Modeling Workshop16–29 (2024). 27

2024

[49] [49]

& Groh, G

Mosca, E., Szigeti, F., Tragianni, S., Gallagher, D. & Groh, G. Shap-based explanation methods: a review for nlp interpretability.Proceedings of the 29th international conference on computational linguistics4593–4603 (2022)

2022

[50] [50]

& Zeng, L

King, G. & Zeng, L. Logistic regression in rare events data.Political analysis9, 137–163 (2001)

2001

[51] [51]

Sun, X. & Xu, W. Fast implementation of delong’s algorithm for comparing the areas under correlated receiver operating characteristic curves.IEEE Signal Processing Letters21, 1389–1393 (2014)

2014

[52] [52]

L., Quincy, C., Osserman, J

Campbell, J. L., Quincy, C., Osserman, J. & Pedersen, O. K. Coding in-depth semistructured interviews: Problems of unitization and intercoder reliability and agreement.Sociological methods & research42, 294–320 (2013)

2013

[53] [53]

& Aragon, C

Chinh, B., Zade, H., Ganji, A. & Aragon, C. Ways of qualitative coding: A case study of four strategies for resolving disagreements.Extended abstracts of the 2019 CHI conference on human factors in computing systems1–6 (2019)

2019

[54] [54]

Dong, Q.et al.A survey on in-context learning.Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing1107–1128 (2024)

2024

[55] [55]

C., Roberts, D

Watkins, S. C., Roberts, D. A., Boulet, J. R., McEvoy, M. D. & Weinger, M. B. Evaluation of a simpler tool to assess nontechnical skills during simulated critical events.Simulation in Healthcare12, 69–75 (2017)

2017

[56] [56]

J., Garrett, J

Viera, A. J., Garrett, J. M.et al.Understanding interobserver agreement: the kappa statistic.Fam med37, 360–363 (2005)

2005

[57] [57]

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological bulletin70, 213 (1968)

Cohen, J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological bulletin70, 213 (1968)

1968

[58] [58]

OpenAI api (2023)

OpenAI. OpenAI api (2023). URL https://platform.openai.com/docs/ introduction. Online; accessed 07-Aug-2025

2023

[59] [59]

Agglomerativeclustering — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. Agglomerativeclustering — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.cluster. AgglomerativeClustering.html. [Online; accessed 2025-08-07]

2023

[60] [60]

Randomforestclassifier — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. Randomforestclassifier — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. RandomForestClassifier.html. [Online; accessed 2025-08-07]

2023

[61] [61]

Gridsearchcv — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. Gridsearchcv — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.model selection. GridSearchCV.html. [Online; accessed 2025-08-07]. 28

2023

[62] [62]

Stratifiedkfold — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. Stratifiedkfold — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.model selection. StratifiedKFold.html. [Online; accessed 2025-08-07]

2023

[63] [63]

cohen kappa score — scikit-learn 1.7.1 documentation (2023)

Scikit-learn. cohen kappa score — scikit-learn 1.7.1 documentation (2023). URL https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen kappa score.html. [Online; accessed 2025-08-07]. 29

2023