RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage

Ali Semih Atalay; Sevgi Yigit-Sert

arxiv: 2606.27174 · v1 · pith:G67LVIRJnew · submitted 2026-06-25 · 💻 cs.LG

RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage

Ali Semih Atalay , Sevgi Yigit-Sert This is my paper

Pith reviewed 2026-06-26 05:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords medical device recallsmulti-task learningBERTFDArecall triageseverity predictionroot cause analysis

0 comments

The pith

A multi-task BERT model on FDA recall records predicts both severity class and root-cause category at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds RecallRisk-BERT to handle the growing volume of FDA medical device recalls by processing both narrative text and categorical fields to classify severity into Class I/II/III and one of nine consolidated root-cause categories. It first benchmarks classical models and then shows that the joint training approach beats a single-task PubMedBERT baseline on the 54,165-record dataset. The model also produces risk rankings that align closely with how severe the root causes prove to be in practice. This setup is presented as a way to automate initial triage after a recall is reported.

Core claim

RecallRisk-BERT, which fuses PubMedBERT text embeddings with embeddings of product code, regulation number, and medical specialty, substantially outperforms the single-task PubMedBERT baseline in the multi-task setting while producing risk rankings that correlate at rho = 0.983 with observed root-cause severity patterns.

What carries the argument

RecallRisk-BERT multi-task architecture that merges PubMedBERT textual representations of recall narratives with embedding-based representations of structured categorical features to predict severity and root cause simultaneously.

If this is right

Text-tabular learning can support scalable post-report recall triage.
Model-derived risk rankings enable model-based root-cause risk analysis.
Automated severity assessment can provide regulatory decision support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion of text and structured fields could be tested on recall data from other countries to check transfer.
If the correlation holds on future data, the model could be used to flag high-risk recalls for faster manual review.
Adding more outcome variables such as patient harm counts might be feasible within the same multi-task setup.

Load-bearing premise

The narrative text and categorical fields in the FDA recall records contain enough signal to classify severity and root cause accurately without substantial label noise or selection bias.

What would settle it

Testing the trained model on recall records filed after October 2025 and checking whether accuracy, F1, AUC, and the 0.983 correlation remain at the reported levels.

Figures

Figures reproduced from arXiv: 2606.27174 by Ali Semih Atalay, Sevgi Yigit-Sert.

**Figure 2.** Figure 2: Proposed RecallRisk-BERT architecture integrating PubMedBERT-based tex [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

read the original abstract

Medical device recalls are a critical regulatory mechanism for protecting patient safety. The growing volume of FDA recall records presents challenges in post-report recall triage, severity assessment, and root-cause interpretation. Existing studies mostly address recall occurrence prediction or root-cause analysis separately, while joint modeling of recall severity and root-cause categories has received limited attention. We develop an automated recall triage framework using 54,165 FDA medical device recall records from openFDA, covering the period from 2002 to October 2025. We first evaluate classical machine learning and boosting-based models for recall severity and root-cause category prediction. We then develop RecallRisk-BERT, a multi-task model that combines PubMedBERT-based textual representations of recall narratives with embedding-based representations of structured categorical features, including product code, regulation number, and medical specialty. The model simultaneously predicts recall severity (Class I/II/III) and a consolidated root-cause category (9 classes). Performance was evaluated using accuracy, macro-averaged precision, recall, F1-score, and ROC-AUC. In single-task severity prediction, our LightGBM-based text--tabular configuration achieved the strongest performance, with an accuracy of 0.963, macro-F1 of 0.856, and ROC-AUC of 0.974. In the multi-task setting, RecallRisk-BERT substantially outperformed the single-task PubMedBERT baseline. Model-derived risk rankings were strongly consistent with observed root-cause severity patterns (rho = 0.983, p = 1.936e-6). These findings indicate that text--tabular learning can support scalable post-report recall triage, regulatory decision support, and model-based root-cause risk analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-task RecallRisk-BERT beats the single-task PubMedBERT baseline on FDA recall data and shows a near-perfect correlation with observed patterns, but the abstract gives almost no information on splits, imbalance handling, or label validation.

read the letter

The headline result is that joint training on severity and root-cause labels from 54k openFDA records lifts performance over a single-task PubMedBERT baseline, and the model-derived rankings line up closely with the observed severity patterns (rho 0.983). That joint setup is the main new piece relative to prior separate modeling work.

The paper does a few things cleanly. It pulls a real regulatory dataset, fuses narrative text with categorical fields like product code and medical specialty, and reports the usual accuracy, macro-F1, and AUC numbers. LightGBM on the text-tabular features already reaches 0.963 accuracy on severity, which is a reasonable baseline to beat.

The soft spots are mostly about missing verification steps. The abstract does not describe the train-test split, whether the rho was computed on held-out data, how class imbalance was handled, or how the nine root-cause classes were consolidated. If the labels carry the usual regulatory noise or selection effects, both the multi-task lift and the high correlation could shrink once those details are checked. The stress-test concern about label quality is fair until the full methods section shows otherwise.

This is a narrow but practical piece aimed at people who build tools for FDA-style triage or health informatics groups that work with regulatory text. It is not reshaping general NLP, but the setup is concrete enough that a serious referee could usefully push on the experimental controls and label reliability.

I would send it to review. The core idea is straightforward and the data source is public, so the authors can add the missing checks without starting over.

Referee Report

2 major / 2 minor

Summary. The paper introduces RecallRisk-BERT, a multi-task framework that fuses PubMedBERT embeddings of recall narratives with embeddings of categorical fields (product code, regulation number, medical specialty) from 54,165 openFDA records (2002–Oct 2025) to jointly predict FDA recall severity (Class I/II/III) and a consolidated 9-class root-cause taxonomy. It reports that a LightGBM text–tabular baseline reaches accuracy 0.963 / macro-F1 0.856 / ROC-AUC 0.974 on severity, that RecallRisk-BERT substantially outperforms single-task PubMedBERT in the multi-task setting, and that model-derived risk rankings correlate with observed root-cause severity patterns at Spearman rho = 0.983 (p = 1.936e-6).

Significance. If the reported metrics and correlation survive proper validation, the work offers a practical, scalable approach to post-report triage of medical-device recalls that could assist regulatory workflows. The multi-task text–tabular architecture and the large curated dataset constitute concrete contributions; the near-perfect rho value, if computed on held-out data, would be a notable empirical finding.

major comments (2)

[Abstract / Methods] Abstract and Methods: the manuscript reports concrete performance numbers and the rho = 0.983 correlation but supplies no description of train/test splits, hyperparameter search protocol, class-imbalance handling, or whether the correlation was evaluated on held-out data. These omissions are load-bearing for the central claim that RecallRisk-BERT “substantially outperformed” the single-task baseline and that the risk rankings are reliable.
[Data / Methods] Data and label construction: the consolidation of root-cause labels into exactly 9 classes is not specified, and no label-validation step, inter-rater reliability check, or noise analysis is described for the severity and root-cause annotations extracted from openFDA narratives. Because both the multi-task gains and the rho = 0.983 result rest on the assumption that these labels are sufficiently clean and unbiased, the absence of such verification undermines the interpretability of all quantitative claims.

minor comments (2)

[Abstract] The abstract states the study period ends “October 2025”; this appears to be a typographical error given the current date.
[Results] Table or figure captions should explicitly state whether metrics are macro-averaged and whether they are computed on the test partition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important gaps in methodological transparency that we will address in revision. Below we respond point by point.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: the manuscript reports concrete performance numbers and the rho = 0.983 correlation but supplies no description of train/test splits, hyperparameter search protocol, class-imbalance handling, or whether the correlation was evaluated on held-out data. These omissions are load-bearing for the central claim that RecallRisk-BERT “substantially outperformed” the single-task baseline and that the risk rankings are reliable.

Authors: We agree that these details are essential for reproducibility and for substantiating the performance claims. The initial submission omitted them primarily for brevity. In the revised manuscript we will add a new subsection in Methods that specifies: the stratified train/validation/test split (70/15/15) preserving class distributions for both tasks; the hyperparameter search protocol (Bayesian optimization with 5-fold cross-validation on the training portion); class-imbalance handling (inverse-frequency weighted cross-entropy loss); and explicit confirmation that all metrics, including the Spearman rho = 0.983, were computed on the held-out test set. These additions will directly support the reported outperformance and correlation results. revision: yes
Referee: [Data / Methods] Data and label construction: the consolidation of root-cause labels into exactly 9 classes is not specified, and no label-validation step, inter-rater reliability check, or noise analysis is described for the severity and root-cause annotations extracted from openFDA narratives. Because both the multi-task gains and the rho = 0.983 result rest on the assumption that these labels are sufficiently clean and unbiased, the absence of such verification undermines the interpretability of all quantitative claims.

Authors: We accept that the label-consolidation procedure must be documented. The 9-class taxonomy was produced by grouping the original openFDA root-cause categories according to semantic and regulatory overlap; we will insert a supplementary table showing the exact mapping together with the rationale. Severity labels are the official FDA Class I/II/III designations and therefore do not require additional validation. For root-cause labels, no formal inter-rater reliability study was performed because the annotations originate from FDA reports. In revision we will add a dedicated paragraph describing potential label noise, report the pre-consolidation label distribution, and include a qualitative consistency check on a random sample of 200 records. We will also note this as a limitation in the Discussion. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical ML evaluation on held-out data with no derivations or self-referential predictions

full rationale

The paper describes training classical ML models and a multi-task BERT variant on 54,165 FDA records, then reports accuracy/F1/AUC on (implicitly held-out) test data plus a Spearman correlation against observed severity patterns. No equations, no fitted parameters renamed as predictions, no self-citation chains, and no ansatz or uniqueness claims appear in the provided text. All performance numbers are external to the model inputs by construction, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Only abstract available; ledger populated from stated elements. Model training involves many implicit free parameters. No invented entities. Standard ML assumptions apply.

free parameters (2)

BERT fine-tuning hyperparameters and classification head weights
Learned during training on the 54k records; exact values and search procedure not stated.
LightGBM hyperparameters
Boosting model parameters fitted to the same data for the single-task baseline.

axioms (2)

domain assumption Recall narratives and categorical fields are sufficiently informative and unbiased for the prediction tasks
Central modeling premise stated implicitly by the choice to train on these features.
standard math Standard supervised learning assumptions (i.i.d. samples, fixed label definitions)
Required for any supervised classification model on this dataset.

pith-pipeline@v0.9.1-grok · 5843 in / 1406 out tokens · 33472 ms · 2026-06-26T05:01:16.730103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 4 canonical work pages

[1]

Ahsan, K., & Gunawan, I. (2014). Analysis of product recalls: Identi- fication of recall initiators and causes of recall.Operations and Supply Chain Management: An International Journal, 7(3), 97–106

2014
[2]

Barbosa Slivinskis, V., Agi Maluli, I., & Broder, J. S. (2025). A machine learning algorithm to predict medical device recall by the Food and Drug Administration.Western Journal of Emergency Medicine, 26(1), 161–170.https://doi.org/10.5811/westjem.21238

work page doi:10.5811/westjem.21238 2025
[3]

Blom, T., & Niemann, W. (2022). Managing reputational risk during supply chain disruption recovery: A triadic logistics outsourcing per- spective.Journal of Transport and Supply Chain Management, 16, a623

2022
[4]

B., Yen, Y.-J., Lian, J.-Y., Sing, M., & Chen, P.-T

Chen, W.-P., Teng, W.-G., Kuo, C. B., Yen, Y.-J., Lian, J.-Y., Sing, M., & Chen, P.-T. (2025). Regulatory insights from 27 years of artificial in- telligence/machine learning–enabled medical device recalls in the United States: Implications for future governance.JMIR Medical Informatics, 13, e67552.https://doi.org/10.2196/67552

work page doi:10.2196/67552 2025
[5]

FDA. (2024). Recalls, corrections and removals (devices). U.S. Food and Drug Administration.https://www.fda. gov/medical-devices/postmarket-requirements-devices/ recalls-corrections-and-removals-devices

2024
[6]

R., Takata, J., Ducey, A., Lehoux, P., Ross, S., Trbovich, P., Easty, A., Bell, C., & Urbach, D

Gagliardi, A. R., Takata, J., Ducey, A., Lehoux, P., Ross, S., Trbovich, P., Easty, A., Bell, C., & Urbach, D. (2017). Medical device recalls in Canada from 2005 to 2015.International Journal of Technology Assess- ment in Health Care, 33(6), 708–714

2017
[7]

Hu, Y., Monticolo, D., & Ghadimi, P. (2025). A machine learning-based medical device recall initiator prediction framework: From supply chain risk management and resilience view.Expert Systems with Applications

2025
[8]

Marucheck, A., Greis, N., Mena, C., & Cai, L. (2011). Product safety and security in the global supply chain: Issues, challenges and research opportunities.Journal of Operations Management, 29(7–8), 707–720

2011
[9]

P., T., S

M.J., A. P., T., S. K., & R., K. (2024). A comprehensive analysis of Class I medical device recalls: Unveiling patterns, causes and global impacts. Cureus, 16(8), e67542.https://doi.org/10.7759/cureus.67542 25

work page doi:10.7759/cureus.67542 2024
[10]

K., & Sinha, K

Mukherjee, U. K., & Sinha, K. K. (2018). Product recall decisions in medicaldevicesupplychains: Abigdataanalyticapproachtoevaluating judgment bias.Production and Operations Management, 27(10), 1790– 1816

2018
[11]

Sarkissian, A. (2018). An exploratory analysis of U.S. FDA Class I med- ical device recalls: 2014–2018.Journal of Medical Engineering & Tech- nology, 42(8), 595–603

2018
[12]

Taylor, N. P. (2023, January 26). FDA Class I medical device recalls hit five-year high in 2022.MedTech Dive

2023
[13]

Thirumalai, S., & Sinha, K. K. (2011). Product recalls in the medical device industry: An empirical exploration of the sources and financial consequences.Management Science, 57(2), 376–392

2011
[14]

L., Guerin, H

Villarraga, M. L., Guerin, H. L., & Lam, R. C. (2007). An analysis of FDA medical device recalls.Journal of Clinical Engineering, 32(2), 79–82

2007
[15]

Zhang, D., & Shen, D. (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease.NeuroImage, 59(2), 895–907

2012
[16]

G., Barnett, J., Kuljis, J., Craven, M

Money, A. G., Barnett, J., Kuljis, J., Craven, M. P., Martin, J. L., & Young, T. (2011). The role of the user within the medical device design and development process: medical device manufacturers’ perspectives. BMC Medical Informatics and Decision Making, 11, 1–12

2011
[17]

U., & Kaminski, P

Ocampo, J. U., & Kaminski, P. C. (2019). Medical device development, from technical design to integrated product development.Journal of Medical Engineering & Technology, 43(5), 287–304

2019
[18]

W., Seo, S

Park, C. W., Seo, S. W., Kang, N., Ko, B., Choi, B. W., Park, C. M., ... & Yoon, H. J. (2020). Artificial intelligence in health care: current applications and issues.Journal of Korean Medical Science, 35(42)

2020
[19]

K., Wong, Y

Mak, K. K., Wong, Y. H., & Pichika, M. R. (2024). Artificial intelligence in drug discovery and development.Drug Discovery and Evaluation: Safety and Pharmacokinetic Assays, 1461–1498. 26

2024
[20]

Briganti, G., & Le Moine, O. (2020). Artificial Intelligence in Medicine: Today and Tomorrow.Frontiers in Medicine, 7:27. doi:10.3389/fmed.2020.00027

work page doi:10.3389/fmed.2020.00027 2020
[21]

Badnjević, A., Avdihodžić, H., & Gurbeta Pokvić, L. (2021). Artificial intelligence in medical devices: Past, present and future.Psychiatria Danubina, 33(suppl 3), 101–106

2021
[22]

J., Daniore, P., & Vokinger, K

Muehlematter, U. J., Daniore, P., & Vokinger, K. N. (2021). Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015–20): a comparative analysis.The Lancet Digital Health, 3(3), e195–e203

2021
[23]

R., Adhikari, S., Garg, H., & Bhandari, M

Joshi, G., Jain, A., Araveeti, S. R., Adhikari, S., Garg, H., & Bhandari, M. (2024). FDA-approved artificial intelligence and machine learning (AI/ML)-enabled medical devices: an updated landscape.Electronics, 13(3), 498

2024
[24]

R., Muti, H

Clusmann, J., Kolbinger, F. R., Muti, H. S., Carrero, Z. I., Eckardt, J. N., Laleh, N. G., ... & Kather, J. N. (2023). The future landscape of large language models in medicine.Communications Medicine, 3(1), 141

2023
[25]

B., Yen, Y., Lian, J., Sing, M., Chen, P

Chen, W., Teng, W., Kuo, C. B., Yen, Y., Lian, J., Sing, M., Chen, P. (2025). Regulatory Insights From 27 Years of Artificial Intelli- gence/Machine Learning–Enabled Medical Device Recalls in the United States: Implications for Future Governance.JMIR Medical Informatics, 13(1), e67552

2025
[26]

P., Kumar, S., Kamaraj, R

M J, A. P., Kumar, S., Kamaraj, R. (2024). A comprehensive analysis of Class I medical device recalls: Unveiling patterns, causes and global impacts,Cureus, 16(8), e67542

2024
[27]

Everhart, A.O., Sen, S., Stern, A.D., Zhu, Y., Karaca-Mandic, P. (2023). Association between regulatory submission characteristics and recalls of medical devices receiving 510(k) clearance,Journal of the American Medical Association (JAMA), 329(2), 144–156

2023
[28]

Zhalechian, M., Saghafian, S., Robles, O. (2024). Harmonizing safety and speed: A human-algorithm approach to enhance the FDA’s medical 27 device clearance policy,arXiv preprint.https://arxiv.org/abs/2407. 11823

2024
[29]

Zhu, Y., Sen, S., Everhart, A., Karaca-Mandic, P. (2025). A deep learn- ing approach for predicting FDA’s 510(k) medical device recalls using device citation relationships,Information Systems Research

2025
[30]

Hu, Y. (2024). In-depth analysis of recall initiators of medical devices with a Machine Learning–Natural Language Processing workflow.arXiv preprint arXiv:2406.10312

arXiv 2024
[31]

Stopwords in technical language processing

Sarica, S., & Luo, J., 2021. Stopwords in technical language processing. PLOS ONE, 16(8), e0254937

2021
[32]

Stemming and lemmatiza- tion: A comparison of retrieval performances.Lecture Notes on Software Engineering, 2(3), 262

Balakrishnan, V., & Lloyd-Yemoh, E., 2014. Stemming and lemmatiza- tion: A comparison of retrieval performances.Lecture Notes on Software Engineering, 2(3), 262

2014
[33]

N., Devi, S

Singh, K. N., Devi, S. D., Devi, H. M., & Mahanta, A. K. (2022). A novel approach for dimension reduction using word embedding: An enhanced text classification approach.International Journal of Information Man- agement Data Insights, 2(1), 100061

2022
[34]

Shi, Y., Yang, Y., & Liu, Y. (2018). Word embedding representation with synthetic position and context information for relation extraction. In2018 IEEE International Conference on Big Knowledge (ICBK)(pp. 106–112). IEEE

2018
[35]

A., & Abdullah, A

Bouke, M. A., & Abdullah, A. (2023). An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability.Expert Systems with Applications, 230, 120715

2023
[36]

Government Accountability Office (GAO)

U.S. Government Accountability Office (GAO). (2025). Medical de- vice recalls: HHS and FDA should address limitations in oversight of recall process (GAO-26-107619).https://www.gao.gov/products/ gao-26-107619

2025
[37]

LaValley, M. P. (2008). Logistic regression.Circulation, 117(18), 2395–2399. 28

2008
[38]

Joachims, T. (2002). Support vector machines. InLearning to classify text using support vector machines(pp. 35–44). Boston, MA: Springer US

2002
[39]

Breiman, L. (2001). Random forests.Machine Learning, 45, 5–32

2001
[40]

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD(pp. 785–794)

2016
[41]

& Liu, T

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree.Advances in Neural Information Processing Systems, 30

2017
[42]

V., & Gulin, A

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features.Ad- vances in Neural Information Processing Systems, 31, 6639–6649

2018
[43]

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.Nature, 521(7553), 436–444

2015
[44]

Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks.IEEE Transactions on Signal Processing, 45(11), 2673–2681

1997
[45]

W., Lee, K., & Toutanova, K

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre- training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT(pp. 4171–4186)

2019
[46]

P., & Ba, J

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic opti- mization.ICLR

2015
[47]

Bishop, C. M. (2006).Pattern Recognition and Machine Learning. Springer

2006
[48]

Hu, Y., Monticolo, D., & Ghadimi, P. (2026). A machine learning-based medical device recall initiator prediction framework: From supply chain risk management and resilience view.Expert Systems with Applications, 298, 129922

2026
[49]

(2020).Failure type prediction in software-related medical device recalls

Emakhu, J., Aguwa, C., Monplaisir, L., Arslanturk, S. (2020).Failure type prediction in software-related medical device recalls. Wayne State University. InProceedings of IISE Annual Conference, (pp. 1-6). 29

2020
[50]

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). Multimodal Deep Learning. InProceedings of ICML, (pp. 689–696)

2011
[51]

Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon. (2021). Domain-specific language model pretraining for biomedical natural language processing.ACM Transactions on Comput- ing for Healthcare, 3(1), 1–23

2021
[52]

J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C.H. So, J. Kang. (2020) BioBERT: A pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4), 1234–1240

2020
[53]

T. Li, W. Zhu, W. Xia, L. Wang, W. Li, P. Zhang. (2024). Research on adverse event classification algorithm of da Vinci surgical robot based on Bert-BiLSTM model.Frontiers in Computational Neuroscience, 18, 1476164

2024
[54]

Luschi, P

A. Luschi, P. Nesi, E. Iadanza. (2023). Evidence-based clinical engineer- ing: Health information technology adverse events identification and classification with natural language processing.Heliyon, 9(11), e21723

2023
[55]

Deznabi, I., Iyyer, M., Fiterau, M. (2021). Predicting in-hospital mor- tality by combining clinical notes with time-series data. InProceedings of ACL-IJCNLP, (pp. 4026–4031),

2021
[56]

Huang, K., Altosaar, J., andRanganath, R.(2019).ClinicalBERT:Mod- elingClinicalNotesandPredictingHospitalReadmission.arXiv preprint https://arxiv.org/abs/1904.05342

Pith/arXiv arXiv 2019
[57]

Salton, G., & Buckley, C. (1988). Term-weighting approaches in au- tomatic text retrieval.Information Processing & Management, 24(5), 513–523

1988
[58]

Pennington, J., Socher, R., Manning, C. D. (2014). Glove: Global vec- tors for word representation. InProceedings of EMNLP, (pp. 689–696)

2014
[59]

, Paliwal., K

Schuster, M. , Paliwal., K. K. (1997) Bidirectional recurrent neural net- works.IEEE Transactions on Signal Processing, 45(11), 2673–2681. 30

1997

[1] [1]

Ahsan, K., & Gunawan, I. (2014). Analysis of product recalls: Identi- fication of recall initiators and causes of recall.Operations and Supply Chain Management: An International Journal, 7(3), 97–106

2014

[2] [2]

Barbosa Slivinskis, V., Agi Maluli, I., & Broder, J. S. (2025). A machine learning algorithm to predict medical device recall by the Food and Drug Administration.Western Journal of Emergency Medicine, 26(1), 161–170.https://doi.org/10.5811/westjem.21238

work page doi:10.5811/westjem.21238 2025

[3] [3]

Blom, T., & Niemann, W. (2022). Managing reputational risk during supply chain disruption recovery: A triadic logistics outsourcing per- spective.Journal of Transport and Supply Chain Management, 16, a623

2022

[4] [4]

B., Yen, Y.-J., Lian, J.-Y., Sing, M., & Chen, P.-T

Chen, W.-P., Teng, W.-G., Kuo, C. B., Yen, Y.-J., Lian, J.-Y., Sing, M., & Chen, P.-T. (2025). Regulatory insights from 27 years of artificial in- telligence/machine learning–enabled medical device recalls in the United States: Implications for future governance.JMIR Medical Informatics, 13, e67552.https://doi.org/10.2196/67552

work page doi:10.2196/67552 2025

[5] [5]

FDA. (2024). Recalls, corrections and removals (devices). U.S. Food and Drug Administration.https://www.fda. gov/medical-devices/postmarket-requirements-devices/ recalls-corrections-and-removals-devices

2024

[6] [6]

R., Takata, J., Ducey, A., Lehoux, P., Ross, S., Trbovich, P., Easty, A., Bell, C., & Urbach, D

Gagliardi, A. R., Takata, J., Ducey, A., Lehoux, P., Ross, S., Trbovich, P., Easty, A., Bell, C., & Urbach, D. (2017). Medical device recalls in Canada from 2005 to 2015.International Journal of Technology Assess- ment in Health Care, 33(6), 708–714

2017

[7] [7]

Hu, Y., Monticolo, D., & Ghadimi, P. (2025). A machine learning-based medical device recall initiator prediction framework: From supply chain risk management and resilience view.Expert Systems with Applications

2025

[8] [8]

Marucheck, A., Greis, N., Mena, C., & Cai, L. (2011). Product safety and security in the global supply chain: Issues, challenges and research opportunities.Journal of Operations Management, 29(7–8), 707–720

2011

[9] [9]

P., T., S

M.J., A. P., T., S. K., & R., K. (2024). A comprehensive analysis of Class I medical device recalls: Unveiling patterns, causes and global impacts. Cureus, 16(8), e67542.https://doi.org/10.7759/cureus.67542 25

work page doi:10.7759/cureus.67542 2024

[10] [10]

K., & Sinha, K

Mukherjee, U. K., & Sinha, K. K. (2018). Product recall decisions in medicaldevicesupplychains: Abigdataanalyticapproachtoevaluating judgment bias.Production and Operations Management, 27(10), 1790– 1816

2018

[11] [11]

Sarkissian, A. (2018). An exploratory analysis of U.S. FDA Class I med- ical device recalls: 2014–2018.Journal of Medical Engineering & Tech- nology, 42(8), 595–603

2018

[12] [12]

Taylor, N. P. (2023, January 26). FDA Class I medical device recalls hit five-year high in 2022.MedTech Dive

2023

[13] [13]

Thirumalai, S., & Sinha, K. K. (2011). Product recalls in the medical device industry: An empirical exploration of the sources and financial consequences.Management Science, 57(2), 376–392

2011

[14] [14]

L., Guerin, H

Villarraga, M. L., Guerin, H. L., & Lam, R. C. (2007). An analysis of FDA medical device recalls.Journal of Clinical Engineering, 32(2), 79–82

2007

[15] [15]

Zhang, D., & Shen, D. (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease.NeuroImage, 59(2), 895–907

2012

[16] [16]

G., Barnett, J., Kuljis, J., Craven, M

Money, A. G., Barnett, J., Kuljis, J., Craven, M. P., Martin, J. L., & Young, T. (2011). The role of the user within the medical device design and development process: medical device manufacturers’ perspectives. BMC Medical Informatics and Decision Making, 11, 1–12

2011

[17] [17]

U., & Kaminski, P

Ocampo, J. U., & Kaminski, P. C. (2019). Medical device development, from technical design to integrated product development.Journal of Medical Engineering & Technology, 43(5), 287–304

2019

[18] [18]

W., Seo, S

Park, C. W., Seo, S. W., Kang, N., Ko, B., Choi, B. W., Park, C. M., ... & Yoon, H. J. (2020). Artificial intelligence in health care: current applications and issues.Journal of Korean Medical Science, 35(42)

2020

[19] [19]

K., Wong, Y

Mak, K. K., Wong, Y. H., & Pichika, M. R. (2024). Artificial intelligence in drug discovery and development.Drug Discovery and Evaluation: Safety and Pharmacokinetic Assays, 1461–1498. 26

2024

[20] [20]

Briganti, G., & Le Moine, O. (2020). Artificial Intelligence in Medicine: Today and Tomorrow.Frontiers in Medicine, 7:27. doi:10.3389/fmed.2020.00027

work page doi:10.3389/fmed.2020.00027 2020

[21] [21]

Badnjević, A., Avdihodžić, H., & Gurbeta Pokvić, L. (2021). Artificial intelligence in medical devices: Past, present and future.Psychiatria Danubina, 33(suppl 3), 101–106

2021

[22] [22]

J., Daniore, P., & Vokinger, K

Muehlematter, U. J., Daniore, P., & Vokinger, K. N. (2021). Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015–20): a comparative analysis.The Lancet Digital Health, 3(3), e195–e203

2021

[23] [23]

R., Adhikari, S., Garg, H., & Bhandari, M

Joshi, G., Jain, A., Araveeti, S. R., Adhikari, S., Garg, H., & Bhandari, M. (2024). FDA-approved artificial intelligence and machine learning (AI/ML)-enabled medical devices: an updated landscape.Electronics, 13(3), 498

2024

[24] [24]

R., Muti, H

Clusmann, J., Kolbinger, F. R., Muti, H. S., Carrero, Z. I., Eckardt, J. N., Laleh, N. G., ... & Kather, J. N. (2023). The future landscape of large language models in medicine.Communications Medicine, 3(1), 141

2023

[25] [25]

B., Yen, Y., Lian, J., Sing, M., Chen, P

Chen, W., Teng, W., Kuo, C. B., Yen, Y., Lian, J., Sing, M., Chen, P. (2025). Regulatory Insights From 27 Years of Artificial Intelli- gence/Machine Learning–Enabled Medical Device Recalls in the United States: Implications for Future Governance.JMIR Medical Informatics, 13(1), e67552

2025

[26] [26]

P., Kumar, S., Kamaraj, R

M J, A. P., Kumar, S., Kamaraj, R. (2024). A comprehensive analysis of Class I medical device recalls: Unveiling patterns, causes and global impacts,Cureus, 16(8), e67542

2024

[27] [27]

Everhart, A.O., Sen, S., Stern, A.D., Zhu, Y., Karaca-Mandic, P. (2023). Association between regulatory submission characteristics and recalls of medical devices receiving 510(k) clearance,Journal of the American Medical Association (JAMA), 329(2), 144–156

2023

[28] [28]

Zhalechian, M., Saghafian, S., Robles, O. (2024). Harmonizing safety and speed: A human-algorithm approach to enhance the FDA’s medical 27 device clearance policy,arXiv preprint.https://arxiv.org/abs/2407. 11823

2024

[29] [29]

Zhu, Y., Sen, S., Everhart, A., Karaca-Mandic, P. (2025). A deep learn- ing approach for predicting FDA’s 510(k) medical device recalls using device citation relationships,Information Systems Research

2025

[30] [30]

Hu, Y. (2024). In-depth analysis of recall initiators of medical devices with a Machine Learning–Natural Language Processing workflow.arXiv preprint arXiv:2406.10312

arXiv 2024

[31] [31]

Stopwords in technical language processing

Sarica, S., & Luo, J., 2021. Stopwords in technical language processing. PLOS ONE, 16(8), e0254937

2021

[32] [32]

Stemming and lemmatiza- tion: A comparison of retrieval performances.Lecture Notes on Software Engineering, 2(3), 262

Balakrishnan, V., & Lloyd-Yemoh, E., 2014. Stemming and lemmatiza- tion: A comparison of retrieval performances.Lecture Notes on Software Engineering, 2(3), 262

2014

[33] [33]

N., Devi, S

Singh, K. N., Devi, S. D., Devi, H. M., & Mahanta, A. K. (2022). A novel approach for dimension reduction using word embedding: An enhanced text classification approach.International Journal of Information Man- agement Data Insights, 2(1), 100061

2022

[34] [34]

Shi, Y., Yang, Y., & Liu, Y. (2018). Word embedding representation with synthetic position and context information for relation extraction. In2018 IEEE International Conference on Big Knowledge (ICBK)(pp. 106–112). IEEE

2018

[35] [35]

A., & Abdullah, A

Bouke, M. A., & Abdullah, A. (2023). An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability.Expert Systems with Applications, 230, 120715

2023

[36] [36]

Government Accountability Office (GAO)

U.S. Government Accountability Office (GAO). (2025). Medical de- vice recalls: HHS and FDA should address limitations in oversight of recall process (GAO-26-107619).https://www.gao.gov/products/ gao-26-107619

2025

[37] [37]

LaValley, M. P. (2008). Logistic regression.Circulation, 117(18), 2395–2399. 28

2008

[38] [38]

Joachims, T. (2002). Support vector machines. InLearning to classify text using support vector machines(pp. 35–44). Boston, MA: Springer US

2002

[39] [39]

Breiman, L. (2001). Random forests.Machine Learning, 45, 5–32

2001

[40] [40]

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD(pp. 785–794)

2016

[41] [41]

& Liu, T

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree.Advances in Neural Information Processing Systems, 30

2017

[42] [42]

V., & Gulin, A

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features.Ad- vances in Neural Information Processing Systems, 31, 6639–6649

2018

[43] [43]

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.Nature, 521(7553), 436–444

2015

[44] [44]

Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks.IEEE Transactions on Signal Processing, 45(11), 2673–2681

1997

[45] [45]

W., Lee, K., & Toutanova, K

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre- training of deep bidirectional transformers for language understanding. InProceedings of NAACL-HLT(pp. 4171–4186)

2019

[46] [46]

P., & Ba, J

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic opti- mization.ICLR

2015

[47] [47]

Bishop, C. M. (2006).Pattern Recognition and Machine Learning. Springer

2006

[48] [48]

Hu, Y., Monticolo, D., & Ghadimi, P. (2026). A machine learning-based medical device recall initiator prediction framework: From supply chain risk management and resilience view.Expert Systems with Applications, 298, 129922

2026

[49] [49]

(2020).Failure type prediction in software-related medical device recalls

Emakhu, J., Aguwa, C., Monplaisir, L., Arslanturk, S. (2020).Failure type prediction in software-related medical device recalls. Wayne State University. InProceedings of IISE Annual Conference, (pp. 1-6). 29

2020

[50] [50]

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). Multimodal Deep Learning. InProceedings of ICML, (pp. 689–696)

2011

[51] [51]

Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon. (2021). Domain-specific language model pretraining for biomedical natural language processing.ACM Transactions on Comput- ing for Healthcare, 3(1), 1–23

2021

[52] [52]

J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C.H. So, J. Kang. (2020) BioBERT: A pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4), 1234–1240

2020

[53] [53]

T. Li, W. Zhu, W. Xia, L. Wang, W. Li, P. Zhang. (2024). Research on adverse event classification algorithm of da Vinci surgical robot based on Bert-BiLSTM model.Frontiers in Computational Neuroscience, 18, 1476164

2024

[54] [54]

Luschi, P

A. Luschi, P. Nesi, E. Iadanza. (2023). Evidence-based clinical engineer- ing: Health information technology adverse events identification and classification with natural language processing.Heliyon, 9(11), e21723

2023

[55] [55]

Deznabi, I., Iyyer, M., Fiterau, M. (2021). Predicting in-hospital mor- tality by combining clinical notes with time-series data. InProceedings of ACL-IJCNLP, (pp. 4026–4031),

2021

[56] [56]

Huang, K., Altosaar, J., andRanganath, R.(2019).ClinicalBERT:Mod- elingClinicalNotesandPredictingHospitalReadmission.arXiv preprint https://arxiv.org/abs/1904.05342

Pith/arXiv arXiv 2019

[57] [57]

Salton, G., & Buckley, C. (1988). Term-weighting approaches in au- tomatic text retrieval.Information Processing & Management, 24(5), 513–523

1988

[58] [58]

Pennington, J., Socher, R., Manning, C. D. (2014). Glove: Global vec- tors for word representation. InProceedings of EMNLP, (pp. 689–696)

2014

[59] [59]

, Paliwal., K

Schuster, M. , Paliwal., K. K. (1997) Bidirectional recurrent neural net- works.IEEE Transactions on Signal Processing, 45(11), 2673–2681. 30

1997