pith. machine review for the scientific record. sign in

arxiv: 2605.11502 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords biomedical literaturepublication type classificationstudy design indexingrobustnesssemantic perturbationsentity maskingdomain-adversarial training
0
0 comments X

The pith

Controlled semantic perturbations combined with selective training let biomedical classifiers gain robustness without losing in-domain accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an evaluation framework that applies controlled semantic perturbations to biomedical abstracts, removing superficial topical cues while keeping the underlying publication type and study design labels intact. It then compares standard training against approaches that combine entity masking with domain-adversarial training to discourage reliance on non-methodological signals. Results indicate that these targeted robustness objectives reduce dependence on spurious domain-specific features and increase use of explicit methodological descriptions. This approach matters because accurate and consistent indexing of publication types supports evidence synthesis and reliable knowledge discovery in biomedicine.

Core claim

We introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: increased reliance on explicit methodological cues when such cues

What carries the argument

The combination of entity masking and domain-adversarial training applied under knowledge-guided perturbations, which selectively suppresses non-task-defining topical features while retaining methodological signals.

If this is right

  • Classifiers maintain high in-domain accuracy while showing improved performance under distributional shifts.
  • Models increase their reliance on explicit methodological cues such as descriptions of study design.
  • Models decrease their reliance on spurious domain-specific topical features.
  • Further refinement of masking and adversarial objectives to more selectively suppress topical information may yield additional robustness gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same perturbation-plus-adversarial strategy could be tested on other scientific text classification tasks that must separate methodology from topic.
  • Large-scale application to full-text biomedical articles rather than abstracts might reveal whether the gains scale beyond short documents.
  • If integrated into literature databases, the resulting classifiers could improve the precision of systematic review search filters.

Load-bearing premise

The controlled semantic perturbations preserve the true publication type and study design labels while only removing superficial or spurious cues.

What would settle it

A new test set in which methodological signals such as explicit study design phrases are also masked or altered; if accuracy on this set shows no gain or drops below the baseline model, the claim that the method preserves salient signals would be falsified.

Figures

Figures reproduced from arXiv: 2605.11502 by Halil Kilicoglu, Joe D. Menke, Neil R. Smalheiser, Shufan Ming.

Figure 1
Figure 1. Figure 1: Saliency visualization of the baseline model for PMID 30127137, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the adversarial training architecture. A pre-trained encoder (SPECTER2-base) produces shared representations that are fed into a PT [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Changes in micro-F1 (∆) under increasing levels of semantic perturbation, measured relative to clean test performance. Solid and dashed lines indicate synonym and concept substitutions, respectively. The final endpoint (100% concept substitution + EDA) corresponds to the performance degradation values reported in Table III. EDA. Across all models, performance degrades as perturbation magnitude increases, w… view at source ↗
Figure 4
Figure 4. Figure 4: Label-wise performance degradation (∆F1) under extreme semantic perturbation for core PTs. is substantially reduced by masked-entity training (∆F1 = 0.101) and the MASK + ADVERSARIAL strategy (∆F1 = 0.114). A similar trend is observed for Case-Control Studies, where the baseline model shows a larger performance drop (∆F1 = 0.102) than the MASK model (∆F1 = 0.070). V. DISCUSSION A. Sensitivity of PT Classif… view at source ↗
Figure 5
Figure 5. Figure 5: Example illustrating a missed prediction of [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example illustrating a corrected prediction of [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example illustrating a correct prediction of [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in-domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in-domain accuracy may rely on superficial lexical or dataset-specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain-specific topical features. These findings highlight the importance of feature-level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an evaluation framework based on controlled semantic perturbations to assess robustness in classifying biomedical publication types and study designs. It combines entity masking with domain-adversarial training to mitigate the trade-off between robustness and in-domain accuracy by selectively suppressing non-task-defining topical features while preserving methodological signals. The authors claim this leads to increased reliance on explicit methodological cues and reduced reliance on spurious domain-specific features.

Significance. If substantiated, this work would be significant for biomedical NLP applications in evidence synthesis, as robust classification under distributional shifts is crucial for reliable knowledge discovery. The approach of designing robustness objectives to target specific feature types rather than blanket regularization is promising, and the public availability of data, code, and models supports reproducibility.

major comments (2)
  1. [Evaluation framework] The central claim that the trade-off between robustness and in-domain accuracy can be mitigated requires that controlled semantic perturbations (entity masking + domain-adversarial training) preserve the true publication type and study design labels while only removing superficial cues. No label-preservation checks, human validation, or quantitative verification that perturbed inputs retain original annotations (e.g., that phrases like 'randomized controlled trial' are not altered) are described; this assumption is load-bearing for interpreting robustness gains as selective feature suppression rather than label corruption.
  2. [Results] The abstract asserts that improvements arise from two mechanisms—increased reliance on methodological cues when present and reduced reliance on spurious topical features—but provides no quantitative metrics, baseline comparisons, error analysis, or details on how these mechanisms were measured (e.g., via ablation, feature attribution, or perturbation-specific performance breakdowns). Without these, the magnitude of mitigation and attribution to the proposed strategies cannot be verified.
minor comments (1)
  1. The abstract states high-level results without specific performance numbers, effect sizes, or dataset details, which would aid assessment of practical impact.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback. We address each of the major comments below and describe the revisions we plan to make.

read point-by-point responses
  1. Referee: [Evaluation framework] The central claim that the trade-off between robustness and in-domain accuracy can be mitigated requires that controlled semantic perturbations (entity masking + domain-adversarial training) preserve the true publication type and study design labels while only removing superficial cues. No label-preservation checks, human validation, or quantitative verification that perturbed inputs retain original annotations (e.g., that phrases like 'randomized controlled trial' are not altered) are described; this assumption is load-bearing for interpreting robustness gains as selective feature suppression rather than label corruption.

    Authors: We agree that explicit verification of label preservation under our controlled semantic perturbations is crucial for validating the interpretation of our results. Although the perturbations are designed using knowledge-guided entity masking to target only non-methodological entities (identified via biomedical NER) while preserving phrases indicative of publication type and study design, the manuscript does not describe formal checks. In the revision, we will add a dedicated analysis section that includes: (1) quantitative metrics measuring the retention rate of key methodological terms across the test set, (2) a human evaluation on a random sample of 100 perturbed instances where annotators verify label consistency, and (3) comparison of model predictions on original vs. perturbed inputs to confirm no systematic label corruption. This will strengthen the claim that robustness gains stem from selective feature suppression. revision: yes

  2. Referee: [Results] The abstract asserts that improvements arise from two mechanisms—increased reliance on methodological cues when present and reduced reliance on spurious topical features—but provides no quantitative metrics, baseline comparisons, error analysis, or details on how these mechanisms were measured (e.g., via ablation, feature attribution, or perturbation-specific performance breakdowns). Without these, the magnitude of mitigation and attribution to the proposed strategies cannot be verified.

    Authors: The full manuscript includes ablation studies comparing our combined entity masking and domain-adversarial approach against baselines without these components, as well as performance breakdowns on perturbed vs. original data. However, we acknowledge that the abstract and results section could more explicitly quantify the two mechanisms. In the revision, we will add: detailed feature attribution results (e.g., using attention visualization or gradient-based methods to demonstrate increased focus on methodological cues post-training), error analysis categorizing failure cases related to topical vs. methodological features, and perturbation-specific performance tables showing robustness gains without accuracy loss. These additions will provide the requested quantitative support and clarify how the mechanisms were measured. revision: partial

Circularity Check

0 steps flagged

No significant circularity; evaluation framework and objectives defined independently of reported metrics

full rationale

The paper defines its evaluation framework via controlled semantic perturbations (entity masking + domain-adversarial training) and reports empirical results on robustness vs. accuracy trade-offs. These components are introduced as external techniques applied to the classification task, with no equations or claims reducing the final performance metrics back to the perturbations by construction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain. The central claim rests on observed improvements from two mechanisms, supported by standard ML training rather than tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are explicitly introduced or fitted in the abstract; the approach relies on standard assumptions from NLP robustness literature such as the separability of task-defining and spurious features.

pith-pipeline@v0.9.0 · 5582 in / 1190 out tokens · 46195 ms · 2026-05-13T02:18:11.059358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Incorporating Values for Indexing Method in MEDLINE/PubMed XML,

    National Library of Medicine (US), “Incorporating Values for Indexing Method in MEDLINE/PubMed XML,”NLM Tech Bulletin, vol. e2, p. 423, 2018, accessed on 02.26.2024. [Online]. Available: nlm.nih.gov/pubs/techbull/ja18/ja18\_indexing\_method.html

  2. [2]

    The NLM Indexing Initiative’s Medical Text Indexer,

    A. R. Aronson, J. G. Mork, C. W. Gay, S. M. Humphrey, and W. J. Rogers, “The NLM Indexing Initiative’s Medical Text Indexer,” in MEDINFO 2004. IOS Press, 2004, pp. 268–272

  3. [3]

    The road from manual to automatic semantic indexing of biomedical literature: a 10 years journey,

    A. Krithara, J. G. Mork, A. Nentidis, and G. Paliouras, “The road from manual to automatic semantic indexing of biomedical literature: a 10 years journey,”Frontiers in Research Metrics and Analytics, vol. 8, p. 1250930, 2023

  4. [4]

    Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing,

    Y . Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing,”ACM Transactions on Computing for Healthcare (HEALTH), vol. 3, no. 1, pp. 1–23, 2021

  5. [5]

    BERTMeSH: deep con- textual representation learning for large-scale high-performance MeSH indexing with full text,

    R. You, Y . Liu, H. Mamitsuka, and S. Zhu, “BERTMeSH: deep con- textual representation learning for large-scale high-performance MeSH indexing with full text,”Bioinformatics, vol. 37, no. 5, pp. 684–692, 2021

  6. [6]

    Evidence-based medicine, the essential role of systematic reviews, and the need for au- tomated text mining tools,

    A. M. Cohen, C. E. Adams, J. M. Davis, C. Yu, P. S. Yu, W. Meng, L. Duggan, M. McDonagh, and N. R. Smalheiser, “Evidence-based medicine, the essential role of systematic reviews, and the need for au- tomated text mining tools,” inProceedings of the 1st ACM international Health Informatics Symposium, 2010, pp. 376–380

  7. [7]

    Evaluation of publication type tagging as a strategy to screen random- ized controlled trial articles in preparing systematic reviews,

    J. Schneider, L. Hoang, Y . Kansara, A. M. Cohen, and N. R. Smalheiser, “Evaluation of publication type tagging as a strategy to screen random- ized controlled trial articles in preparing systematic reviews,”JAMIA open, vol. 5, no. 1, p. ooac015, 2022

  8. [8]

    Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine,

    A. M. Cohen, N. R. Smalheiser, M. S. McDonagh, C. Yu, C. E. Adams, J. M. Davis, and P. S. Yu, “Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine,” Journal of the American Medical Informatics Association, vol. 22, no. 3, pp. 707–717, 2015

  9. [9]

    Evidence based medicine: what it is and what it isn’t,

    D. L. Sackett, W. M. Rosenberg, J. M. Gray, R. B. Haynes, and W. S. Richardson, “Evidence based medicine: what it is and what it isn’t,” pp. 71–72, 1996

  10. [10]

    Fifty Ways to Tag your Pubtypes: Multi-Tagger, a Set of Probabilistic Publication Type and Study Design Taggers to Support Biomedical Indexing and Evidence-Based Medicine,

    A. M. Cohen, J. Schneider, Y . Fu, M. S. McDonagh, P. Das, A. W. Holt, and N. R. Smalheiser, “Fifty Ways to Tag your Pubtypes: Multi-Tagger, a Set of Probabilistic Publication Type and Study Design Taggers to Support Biomedical Indexing and Evidence-Based Medicine,”medRxiv, pp. 2021–07, 2021

  11. [11]

    Publication Type Tagging using Transformer Models and Multi-Label Classification,

    J. D. Menke, H. Kilicoglu, and N. R. Smalheiser, “Publication Type Tagging using Transformer Models and Multi-Label Classification,” in AMIA Annual Symposium Proceedings, vol. 2024, 2025, p. 818

  12. [12]

    Enhancing automated indexing of publication types and study designs in biomedical literature using full-text features,

    J. D. Menke, S. Ming, S. Radhakrishna, H. Kilicoglu, and N. R. Smalheiser, “Enhancing automated indexing of publication types and study designs in biomedical literature using full-text features,”medRxiv, pp. 2025–04, 2025

  13. [13]

    SciRepEval: A Multi-Format Benchmark for Scientific Document Representations,

    A. Singh, M. D’Arcy, A. Cohan, D. Downey, and S. Feldman, “SciRepEval: A Multi-Format Benchmark for Scientific Document Representations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 5548–5566. [Online]. ...

  14. [14]

    Beyond Distribution Shift: Spurious Features Through the Lens of Training Dynamics,

    N. Murali, A. Puli, K. Yu, R. Ranganath, and K. Batmanghelich, “Beyond Distribution Shift: Spurious Features Through the Lens of Training Dynamics,”Transactions on machine learning research, vol. 2023, pp. https–openreview, 2023

  15. [15]

    The role of natural language processing in improving cancer care: A scoping review with narrative synthesis,

    M. Sun, E. Reiter, L. Duncan, and R. Adam, “The role of natural language processing in improving cancer care: A scoping review with narrative synthesis,”Artificial Intelligence in Medicine, p. 103227, 2025

  16. [16]

    Measure and Improve Robustness in NLP Models: A Survey,

    X. Wang, H. Wang, and D. Yang, “Measure and Improve Robustness in NLP Models: A Survey,” inProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, 2022, pp. 4569–4586

  17. [17]

    The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?

    J. Bastings and K. Filippova, “The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?” inProceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020, pp. 149–155

  18. [18]

    Captum: A unified and generic model inter- pretability library for PyTorch,

    N. Kokhlikyan, V . Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, and O. Reblitz-Richardson, “Captum: A unified and generic model inter- pretability library for PyTorch,” 2020

  19. [19]

    Improving the robustness and accuracy of biomedical language models through adversarial training,

    M. Moradi and M. Samwald, “Improving the robustness and accuracy of biomedical language models through adversarial training,”Journal of Biomedical Informatics, vol. 132, p. 104114, 2022

  20. [20]

    On Adversarial Examples for Biomedical NLP Tasks,

    V . Araujo, A. Carvallo, C. Aspillaga, and D. Parra, “On Adversarial Examples for Biomedical NLP Tasks,”arXiv preprint arXiv:2004.11157, 2020

  21. [21]

    Adversarial training for large neural language models

    X. Liu, H. Cheng, P. He, W. Chen, Y . Wang, H. Poon, and J. Gao, “Ad- versarial Training for Large Neural Language Models,”arXiv preprint arXiv:2004.08994, 2020

  22. [22]

    A Survey of Adversarial Defenses and Robustness in NLP,

    S. Goyal, S. Doddapaneni, M. M. Khapra, and B. Ravindran, “A Survey of Adversarial Defenses and Robustness in NLP,”ACM Computing Surveys, vol. 55, no. 14s, pp. 1–39, 2023

  23. [23]

    Boosting low-resource biomedical qa via entity-aware masking strategies,

    G. Pergola, E. Kochkina, L. Gui, M. Liakata, and Y . He, “Boosting low-resource biomedical qa via entity-aware masking strategies,” inProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr....

  24. [24]

    COVID-QA: A Question Answering Dataset for COVID-19,

    T. Möller, A. Reina, R. Jayakumar, and M. Pietsch, “COVID-QA: A Question Answering Dataset for COVID-19,” inProceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, 2020

  25. [25]

    BioASQ- QA: A manually curated corpus for Biomedical Question Answering,

    A. Krithara, A. Nentidis, K. Bougiatiotis, and G. Paliouras, “BioASQ- QA: A manually curated corpus for Biomedical Question Answering,” Scientific Data, vol. 10, no. 1, p. 170, 2023

  26. [26]

    Unsupervised Domain Adaptation by Backpropagation,

    Y . Ganin and V . Lempitsky, “Unsupervised Domain Adaptation by Backpropagation,” inInternational conference on machine learning. PMLR, 2015, pp. 1180–1189

  27. [27]

    Measuring the Robustness of NLP Models to Domain Shifts,

    N. Calderon, N. Porat, E. Ben-David, A. Chapanin, Z. Gekhman, N. Oved, V . Shalumov, and R. Reichart, “Measuring the Robustness of NLP Models to Domain Shifts,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 126–154

  28. [28]

    Robustness to spurious correlations in text classification via automatically generated counterfactuals,

    Z. Wang and A. Culotta, “Robustness to spurious correlations in text classification via automatically generated counterfactuals,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 16, 2021, pp. 14 024–14 031

  29. [29]

    Asymmetric Loss for Multi-Label Classification,

    T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, and L. Zelnik-Manor, “Asymmetric Loss for Multi-Label Classification,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 82–91

  30. [30]

    When Does Label Smoothing Help?

    R. Müller, S. Kornblith, and G. E. Hinton, “When Does Label Smoothing Help?”Advances in Neural Information Processing systems, vol. 32, 2019

  31. [31]

    Contrastive Learning with Complex Heterogeneity,

    L. Zheng, J. Xiong, Y . Zhu, and J. He, “Contrastive Learning with Complex Heterogeneity,” inProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 2594– 2604

  32. [32]

    Understanding Con- trastive Learning via Distributionally Robust Optimization,

    J. Wu, J. Chen, J. Wu, W. Shi, X. Wang, and X. He, “Understanding Con- trastive Learning via Distributionally Robust Optimization,”Advances in Neural Information Processing Systems, vol. 36, 2024

  33. [33]

    The Unified Medical Language System (UMLS): integrating biomedical terminology,

    O. Bodenreider, “The Unified Medical Language System (UMLS): integrating biomedical terminology,”Nucleic acids research, vol. 32, no. suppl_1, pp. D267–D270, 2004

  34. [34]

    An overview of MetaMap: historical perspective and recent advances,

    A. R. Aronson and F.-M. Lang, “An overview of MetaMap: historical perspective and recent advances,”Journal of the American medical informatics association, vol. 17, no. 3, pp. 229–236, 2010

  35. [35]

    EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks,

    J. Wei and K. Zou, “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6382–6388

  36. [36]

    Domain-Adversarial Training of Neural Networks,

    Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi- olette, M. March, and V . Lempitsky, “Domain-Adversarial Training of Neural Networks,”Journal of machine learning research, vol. 17, no. 59, pp. 1–35, 2016

  37. [37]

    Obtaining Well Calibrated Probabilities Using Bayesian Binning,

    M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining Well Calibrated Probabilities Using Bayesian Binning,” inProceedings of the AAAI conference on artificial intelligence, vol. 29, no. 1, 2015

  38. [38]

    Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics,

    P. Kumar and S. Mishra, “Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics,”arXiv preprint arXiv:2505.18658, 2025

  39. [39]

    TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP,

    J. X. Morris, E. Lifland, J. Y . Yoo, J. Grigsby, D. Jin, and Y . Qi, “TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP,”EMNLP 2020, p. 119, 2020

  40. [40]

    Salient information preserving adver- sarial training improves clean and robust accuracy,

    T. Redgrave and A. Czajka, “Salient information preserving adver- sarial training improves clean and robust accuracy,”arXiv preprint arXiv:2501.09086, 2025

  41. [41]

    Adversarial examples are not bugs, they are features,

    A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. M ˛ adry, “Adversarial examples are not bugs, they are features,” inProceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 125–136

  42. [42]

    SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications,

    L. Jiang, C. J. V orland, X. Ying, A. W. Brown, J. D. Menke, G. Hong, M. Lan, E. Mayo-Wilson, and H. Kilicoglu, “SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications,”Scientific Data, vol. 12, no. 1, p. 355, 2025