Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

David Rey-Blanco; Roberto Cruz

arxiv: 2605.30529 · v1 · pith:F3A36Y6Mnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI· cs.LG

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

David Rey-Blanco , Roberto Cruz This is my paper

Pith reviewed 2026-06-29 07:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords clinical codingsemantic searchmultilingual retrievalsynthetic databi-encodercross-encoderICD-10non-English languages

0 comments

The pith

A bi-encoder fine-tuned on synthetic clinical data matches BioBERT-ST on retrieval for ICD codes across five languages without English biomedical pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether sentence-embedding models developed mainly on English can be replaced for clinical code retrieval in other languages by fine-tuning a Spanish biomedical encoder on synthetic query-code pairs. It builds a two-stage system that first uses a bi-encoder to surface candidates and then applies a cross-encoder to rerank them. The approach reaches or exceeds the English model's scores on most languages and metrics, showing that targeted synthetic data can close the performance gap for non-English clinical search.

Core claim

Fine-tuning the Spanish biomedical encoder on approximately 19,500 Gemini-generated synthetic pairs for six languages produces a bi-encoder that attains MRR 0.876 (versus BioBERT-ST 0.866), R@3 0.650 (versus 0.626) and R@5 0.804 (versus 0.790). The added cross-encoder reranker raises aggregate R@5 to 0.822 and improves four of the five non-English languages, with Portuguese reaching 0.829 versus BioBERT-ST's 0.714.

What carries the argument

Two-stage retriever of bi-encoder followed by cross-encoder reranker, fine-tuned on LLM-generated synthetic clinical query-code pairs.

If this is right

The bi-encoder alone matches or exceeds the English baseline on aggregate metrics without English pretraining.
The cross-encoder reranker produces additional gains that concentrate in Catalan, Portuguese, Spanish and French.
An open recipe exists for constructing domain-specific medical retrievers from generated data alone.
The learning gain from the synthetic pairs lifts MRR from 0.755 to 0.876.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic-data pipeline could be tested on other medical terminologies such as SNOMED CT.
Mixing a small amount of real English data during fine-tuning might eliminate the minor English regression.
Deployment in hospitals would still require separate validation on authentic query logs.

Load-bearing premise

The synthetic queries and code descriptions produced by the large language model match the distribution and linguistic features of real clinical text in each target language.

What would settle it

Evaluating the fine-tuned retriever on a collection of genuine hospital queries in Spanish or Portuguese and finding that R@5 falls below BioBERT-ST's level.

Figures

Figures reproduced from arXiv: 2605.30529 by David Rey-Blanco, Roberto Cruz.

**Figure 1.** Figure 1: Two-stage retriever: a multilingual bi-encoder feeds a Spanish-tuned cross-encoder reranker. The asymmetry between the two stages is the central object of this study. Source: internal 2.2 Synthetic data generation for training We use Google DeepMind Gemini 2.5 Flash Pro [33] as a data factory to generate two dataset families grounded on the ICD-10 chapter hierarchy: 1. Dataset A (bi-encoder training). This… view at source ↗

**Figure 2.** Figure 2: CodiESP — F1 and MAP@10 by model, at the exact-code and three-character category levels. MAP exceeds F1 for every model, which means that for the queries where top-1 is wrong, the correct code is still often present in the top-10 (the ranking signal is there, the calibration is what is missing). Three observations follow directly. First, the TietAI cross-encoder is best on every metric and at every resolut… view at source ↗

**Figure 3.** Figure 3: CodiESP — Top-1 precision, recall and F1 at the exact-code level. The TietAI Cross-Encoder dominates every metric; the precision–recall gap visible for BM25 is the only place these three metrics diverge. 3.2.2 Precision and recall across the top-k ranking [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: CodiESP — Precision@k by retrieval model. Precision decays with k because each query has typically one gold code, but the relative ordering of methods is preserved at every k [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: CodiESP — Recall@k by retrieval model. The TietAI cross-encoder reaches R@10 ≈ 0.81 exact / 0.90 category; the bi-encoder closes most of the gap by k = 10, confirming that the bi-encoder’s recall is already strong and what the cross-encoder supplies is the top-1 ordering. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: DISTEMIST — F1 and MAP@10 by model. The relative ordering of cross-encoder > bi-encoder > BM25 holds at every metric and at both resolutions, replicating the CodiESP pattern on a corpus the cross-encoder has not been trained on. As on CodiESP, MAP is uniformly higher than F1, by a wider margin for the bi-encoder (+0.079 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap. We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0.876 vs. 0.866) and overtakes it on R@3 (0.650 vs. 0.626) and R@5 (0.804 vs. 0.790) without English biomedical pretraining. Adding a cross-encoder reranker lifts aggregate R@5 to 0.822 and dominates on four of five languages (+0.017 Spanish, +0.033 Catalan, +0.018 French, +0.037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0.829 vs. BioBERT-ST's 0.714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0.755 to 0.876, +15.9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning on Gemini synthetic data gives measurable lifts over BioBERT-ST for clinical code search in five languages, but the gains are measured only against held-out synthetic queries.

read the letter

The main things to know are that a Spanish biomedical bi-encoder fine-tuned on roughly 19,500 Gemini-generated pairs reaches MRR 0.876 and R@5 0.804, matching or beating BioBERT-ST without English pretraining, and that a cross-encoder reranker then pushes aggregate R@5 to 0.822 with clear per-language gains in Spanish, Catalan, French, and Portuguese.

The paper does a clean job of quantifying the improvement (+15.9% MRR from the base) and breaking it down by language and rank, which makes the practical trade-offs visible. The open recipe for generating the training pairs is also useful for anyone who needs to adapt retrieval models to new medical domains or languages.

The soft spot is the complete dependence on synthetic data for both training and evaluation. Nothing in the abstract indicates a hold-out set of real clinical queries or discharge summaries, so it is unclear whether the reported lifts survive the abbreviations, shorthand, and local phrasing that appear in actual use. The small English regression is acknowledged but secondary to this larger question.

This is aimed at applied researchers building non-English medical search systems. It is worth sending for peer review because the task is concrete, the baselines are external, and the numbers are specific enough to be checked, even though the authors will need to add real-data validation before the central claim is solid.

Referee Report

3 major / 2 minor

Summary. The paper claims that fine-tuning a Spanish biomedical bi-encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) and a subsequent cross-encoder reranker on ~19,500 Gemini-generated synthetic (query, ICD-10 code) pairs across six languages produces retrieval performance that matches or exceeds BioBERT-ST on MRR (0.876 vs. 0.866), R@3, and R@5 without English biomedical pretraining; the reranker further improves aggregate R@5 to 0.822 and yields gains on four of five non-English languages.

Significance. If the synthetic data distribution matches real clinical queries, the work supplies a practical, open recipe for domain-specific multilingual clinical retrievers and quantifies a substantial learning gain (+15.9% MRR) from modest synthetic data volume, with language-specific trade-offs that are clinically relevant for Portuguese.

major comments (3)

[Evaluation] Evaluation section: all reported metrics (including the headline bi-encoder MRR 0.876 / R@5 0.804 and reranker R@5 0.822) are computed exclusively on held-out Gemini-generated pairs; no side-by-side results on authentic clinical queries, discharge summaries, or real coding records from the five target languages are presented. This directly affects the central claim of applicability to non-English clinical coding.
[Data generation] Data generation and validation subsection: the manuscript provides no quantitative comparison (e.g., n-gram overlap, abbreviation frequency, or stylistic metrics) between the Gemini synthetic queries and any sample of real clinical phrasing in Spanish, Catalan, etc., leaving the weakest assumption untested.
[Results] Results tables: no statistical significance tests (paired t-test, bootstrap CI, or McNemar) are reported for the observed differences versus BioBERT-ST, so it is unclear whether the +0.010 MRR or +0.032 R@5 aggregate lifts are reliable.

minor comments (2)

[Abstract] The abstract states the bi-encoder 'matches BioBERT-ST on MRR' yet reports 0.876 vs. 0.866; clarify whether this difference is considered within noise or statistically meaningful.
[Experimental setup] Clarify the exact train / validation / test split ratios and whether any language-specific hyper-parameter tuning was performed.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and indicate the revisions planned.

read point-by-point responses

Referee: [Evaluation] Evaluation section: all reported metrics (including the headline bi-encoder MRR 0.876 / R@5 0.804 and reranker R@5 0.822) are computed exclusively on held-out Gemini-generated pairs; no side-by-side results on authentic clinical queries, discharge summaries, or real coding records from the five target languages are presented. This directly affects the central claim of applicability to non-English clinical coding.

Authors: We agree that the evaluation is performed solely on held-out synthetic pairs and that this constrains direct claims of applicability to real clinical coding workflows. Large-scale, publicly available labeled query sets for ICD-10 retrieval in Catalan, Portuguese and the other target languages do not exist, which is why the study used synthetic data. In revision we will add an explicit Limitations section that states this constraint, qualifies the applicability claims, and identifies collection of real clinical queries as necessary future work. revision: yes
Referee: [Data generation] Data generation and validation subsection: the manuscript provides no quantitative comparison (e.g., n-gram overlap, abbreviation frequency, or stylistic metrics) between the Gemini synthetic queries and any sample of real clinical phrasing in Spanish, Catalan, etc., leaving the weakest assumption untested.

Authors: The observation is correct; no quantitative distributional comparison between the synthetic queries and real clinical text was included. We lacked access to representative real-world query samples across all six languages. We will add this gap to the Limitations section and frame it as an open validation task for subsequent studies. revision: yes
Referee: [Results] Results tables: no statistical significance tests (paired t-test, bootstrap CI, or McNemar) are reported for the observed differences versus BioBERT-ST, so it is unclear whether the +0.010 MRR or +0.032 R@5 aggregate lifts are reliable.

Authors: We accept that the absence of significance testing leaves the magnitude of the reported gains open to question. Because the per-query scores are available, we will compute bootstrap confidence intervals on the key metrics and differences and report them in the revised tables. revision: yes

standing simulated objections not resolved

Side-by-side evaluation on authentic clinical queries or discharge summaries from the five non-English languages, because no such labeled datasets were available to the authors.

Circularity Check

0 steps flagged

No significant circularity; empirical comparison on held-out synthetic data against external baselines

full rationale

The paper reports an empirical study that fine-tunes a bi-encoder and cross-encoder on Gemini-generated synthetic (query, code) pairs and measures MRR and recall on held-out synthetic test data, with direct numerical comparisons to BioBERT-ST and the untuned encoder. No derivation chain, equations, or uniqueness claims are present; performance figures are experimental outcomes rather than quantities forced by construction from the training inputs. No self-citations are invoked as load-bearing premises, and no ansatz or renaming patterns appear. The central claims rest on observable metric differences, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of LLM-generated synthetic data for real clinical queries across languages; no free parameters or invented entities are explicitly introduced beyond the choice of base encoder and data volume.

axioms (1)

domain assumption Synthetic data from Gemini is of sufficient quality and coverage to train effective retrievers for clinical coding tasks in the studied languages.
Invoked to justify fine-tuning from the Spanish encoder and to explain the observed gains over baselines.

pith-pipeline@v0.9.1-grok · 5857 in / 1502 out tokens · 38566 ms · 2026-06-29T07:24:54.774420+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 18 canonical work pages · 10 internal anchors

[1]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. 19

2019
[2]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[3]

Passage re-ranking with BERT, 2019

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT, 2019

2019
[4]

Improving efficient neural ranking models with cross-architecture knowledge distillation

Sebastian Hofstätter, Sophia Althammer, Mete Schröder, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation. In arXiv preprint arXiv:2010.02666, 2020

work page arXiv 2010
[5]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[6]

Benchmarking retrieval-augmented generation for medicine

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics (ACL Findings), 2024

2024
[7]

Retrieval-augmented generation (RAG) in healthcare: A comprehensive review.AI (MDPI), 2025

Others. Retrieval-augmented generation (RAG) in healthcare: A comprehensive review.AI (MDPI), 2025

2025
[8]

Explainable prediction of medical codes from clinical text.NAACL-HLT, 2018

James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text.NAACL-HLT, 2018

2018
[9]

Shaoxiong Ji, Sina Pan, Erik Cambria, Pekka Marttinen, and Philip S. Yu. Does the magic of BERT apply to medical code assignment? a quantitative study.Computers in Biology and Medicine, 139:104998, 2021

2021
[10]

Code synonyms do matter: Multiple synonyms matching network for automatic ICD coding

Zheng Yuan, Chuanqi Tan, and Songfang Huang. Code synonyms do matter: Multiple synonyms matching network for automatic ICD coding. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022

2022
[11]

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

2020
[12]

Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B

Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical BERT embeddings.Proceedings of the 2nd Clinical Natural Language Processing Workshop (NAACL), 2019

2019
[13]

Pubmedqa: A dataset for biomedical research question answering.EMNLP, 2019

Qiao Jin, Bhuwan Dhingra, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.EMNLP, 2019

2019
[14]

The state and fate of linguistic diversity and inclusion in the NLP world.Proceedings of the 58th Annual Meeting of the ACL, 2020

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world.Proceedings of the 58th Annual Meeting of the ACL, 2020

2020
[15]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. 20

2020
[16]

How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the ACL, 2019

Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the ACL, 2019

2019
[17]

Making monolingual sentence embeddings multilingual using knowledge distillation

Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[18]

Clinical natural language processing in languages other than English: Opportunities and challenges.Journal of Biomedical Semantics, 9(12), 2018

Aurélie Névéol, Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and Pierre Zweigen- baum. Clinical natural language processing in languages other than English: Opportunities and challenges.Journal of Biomedical Semantics, 9(12), 2018

2018
[19]

Pretrained biomedical language models for clinical NLP in Spanish.Proceedings of the 21st Workshop on Biomedical Language Processing, BioNLP at ACL, 2022

Casimiro Pio Carrino, Joan Llop, Marc Pàmies, Asier Gutiérrez-Fandiño, Jordi Armengol- Estapé, Joaquín Silveira-Ocampo, Alfonso Valencia, Aitor Gonzalez-Agirre, and Marta Villegas. Pretrained biomedical language models for clinical NLP in Spanish.Proceedings of the 21st Workshop on Biomedical Language Processing, BioNLP at ACL, 2022

2022
[20]

From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers

Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and Goran Glavaš. From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. InEMNLP, 2020

2020
[21]

Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT

Shijie Wu and Mark Dredze. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. InEMNLP-IJCNLP, 2019

2019
[22]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st Annual Meeting of the ACL, 2023

2023
[23]

InPars: Unsupervised dataset generation for information retrieval

Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. InPars: Unsupervised dataset generation for information retrieval. InSIGIR, 2022

2022
[24]

Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith Hall, and Ming-Wei Chang

Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples, 2022

2022
[25]

Smith, Nima PourNejatian, Anthony B

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E. Smith, Nima PourNejatian, Anthony B. Costa, Cheryl Martin, Mona G. Flores, Ying Zhang, Tanja Magoc, Gloria Lipori, Duane A. Mitchell, Naykky S. Ospina, Mustafa M. Ahmed, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian, and Yonghui Wu. A study of generative large language model for medical research and ...

2023
[26]

Knowledge-infused prompting: Assessing and advancing clinical text data generation with large language models

Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce Ho, and Carl Yang. Knowledge-infused prompting: Assessing and advancing clinical text data generation with large language models. InFindings of the Association for Computational Linguistics: ACL 2024, 2024

2024
[27]

Two directions for clinical data generation with large language models: Data-to-label and label-to-data

Rumeng Li, Xun Wang, and Hong Yu. Two directions for clinical data generation with large language models: Data-to-label and label-to-data. InFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

2023
[28]

MedSyn: LLM-based synthetic medical text generation framework

Gleb Kumichev, Pavel Blinov, Yulia Kuzkina, Vasily Goncharov, Galina Zubkova, Nikolai Zenovkin, Aleksei Goncharov, and Andrey Savchenko. MedSyn: LLM-based synthetic medical text generation framework. InMachine Learning and Knowledge Discovery in Databases, 2024. 21

2024
[29]

On LLMs-driven synthetic data generation, curation, and evaluation: A survey.arXiv preprint arXiv:2406.15126, 2024

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On LLMs-driven synthetic data generation, curation, and evaluation: A survey.arXiv preprint arXiv:2406.15126, 2024

work page arXiv 2024
[30]

The curse of recursion: Training on generated data makes models forget, 2023

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget, 2023

2023
[31]

Efficient natural language response suggestion for Smart Reply, 2017

Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient natural language response suggestion for Smart Reply, 2017

2017
[32]

Learning to rank: From pairwise approach to listwise approach

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: From pairwise approach to listwise approach. InProceedings of the 24th International Conference on Machine Learning, pages 129–136, 2007

2007
[33]

Gemini 2.5 pro, 2025

Google DeepMind. Gemini 2.5 pro, 2025

2025
[34]

Overview of automatic clinical coding: Annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of CLEF eHealth 2020

Antonio Miranda-Escalada, Aitor Gonzalez-Agirre, Jordi Armengol-Estapé, and Martin Krallinger. Overview of automatic clinical coding: Annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of CLEF eHealth 2020. InWorking Notes of CLEF 2020 – Conference and Labs of the Evaluation Forum, 2020

2020
[35]

Unsupervised keyword combination query generation from online health related content for evidence-based fact checking

Pritam Deka and Anna Jurek-Loughrey. Unsupervised keyword combination query generation from online health related content for evidence-based fact checking. InThe 23rd International Conference on Information Integration and Web Intelligence, pages 267–277, 2021

2021
[36]

MPNet: Masked and permuted pre-training for language understanding

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MPNet: Masked and permuted pre-training for language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[37]

Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from spanish clinical cases

Antonio Miranda-Escalada, Luis Gascó, Salvador Lima-López, Eulàlia Farré-Maduell, Daniel Estrada, Anastasios Nentidis, Anastasia Krithara, Georgios Katsimpras, Georgios Paliouras, and Martin Krallinger. Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from spanish clinical cases. InWorking Notes of CLEF 2022, 2022

2022
[38]

Centers for Medicare & Medicaid Services and National Center for Health Statistics.ICD-10- CM Official Guidelines for Coding and Reporting, FY 2024, 2024

2024
[39]

Ministerio de Sanidad, Servicios Sociales e Igualdad, Gobierno de España.Manual de Codificación CIE-10-ES Diagnósticos, 6ªEdición, 2024

2024
[40]

How good is your tokenizer? on the monolingual performance of multilingual language models

Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of ACL-IJCNLP, 2021

2021
[41]

Mortensen, Noah A

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. InProceedings of EMNLP, 2023

2023
[42]

Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 22

2023
[43]

Multilingual clinical NER: Translation or cross-lingual transfer?Proceedings of the LREC BioNLP Workshop, 2019

Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre, Martin Krallinger, and Jordi Armengol- Estapé. Multilingual clinical NER: Translation or cross-lingual transfer?Proceedings of the LREC BioNLP Workshop, 2019

2019
[44]

IAM at CLEF eHealth 2020: Concept annotation in Spanish electronic health records

Aitor Garcia-Pablos, Naiara Perez, and Montse Cuadros. IAM at CLEF eHealth 2020: Concept annotation in Spanish electronic health records. InWorking Notes of CLEF 2020, 2020

2020
[45]

Integrating agentic artificial intelligence to automate ICD-10 medical coding.Informatics (MDPI), 2026

Preprints.org Authors. Integrating agentic artificial intelligence to automate ICD-10 medical coding.Informatics (MDPI), 2026

2026
[46]

Fernando Gallego, Guillermo López-García, Luis Gascó-Sánchez, Martin Krallinger, and Francisco J. Veredas. ClinLinker: Medical entity linking of clinical concept mentions in Spanish. InInternational Conference on Computational Science (ICCS), 2024

2024
[47]

PLM-ICD: Automatic ICD coding withpretrainedlanguagemodels

Chao-Wei Huang, Shang-Chi Tsai, and Yun-Nung Chen. PLM-ICD: Automatic ICD coding withpretrainedlanguagemodels. InProceedings of the 4th Clinical Natural Language Processing Workshop (ACL-ClinicalNLP), 2022

2022
[48]

GoM-ICD: Automatic ICD coding with gap schemes and mixture of experts.Big Data Mining and Analytics, 2025

Yuxiang Pan et al. GoM-ICD: Automatic ICD coding with gap schemes and mixture of experts.Big Data Mining and Analytics, 2025

2025
[49]

Automatic ICD coding using LLMs: a systematic review.medRxiv preprint, 2025

medRxiv Authors. Automatic ICD coding using LLMs: a systematic review.medRxiv preprint, 2025

2025
[50]

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

OpenAI. HealthBench Professional: Evaluating large language models on real clinician chats. arXiv preprint arXiv:2604.27470, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Ruchir Arora et al. HealthBench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Introducing HealthBench, 2025

OpenAI. Introducing HealthBench, 2025

2025
[53]

INFO Team

DR. INFO Team. OpenAI’s HealthBench in action: Evaluating an LLM-based medical assistant on realistic clinical queries.arXiv preprint arXiv:2509.02594, 2025

work page arXiv 2025
[54]

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments.npj Digital Medicine, 2026

Samuel Schmidgall et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments.npj Digital Medicine, 2026

2026
[55]

MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents.NEJM AI, 2025

Stanford ML Group et al. MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents.NEJM AI, 2025

2025
[56]

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, et al. PhysicianBench: Evaluating LLM agents in real-world EHR environments.arXiv preprint arXiv:2605.02240, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

MedMT-Bench: Can LLMs memorize and understand long multi-turn conversations in medical scenarios?arXiv preprint arXiv:2603.23519, 2026

Lin Yang, Yuancheng Yang, Xu Wang, Changkun Liu, and Haihua Yang. MedMT-Bench: Can LLMs memorize and understand long multi-turn conversations in medical scenarios?arXiv preprint arXiv:2603.23519, 2026

work page arXiv 2026
[58]

MedAgents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang et al. MedAgents: Large language models as collaborators for zero-shot medical reasoning. 2024

2024
[59]

Glicksberg, Girish N

Alon Gorenshtein, Mahmud Omar, Benjamin S. Glicksberg, Girish N. Nadkarni, and Eyal Klang. AI agents in clinical medicine: A systematic review.medRxiv preprint, 2025. 23

2025
[60]

Medical Reasoning with Large Language Models: A Survey and MR-Bench

Others. Medical reasoning with large language models: A survey and MR-Bench.arXiv preprint arXiv:2604.08559, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Walker, et al

Oliver Normand, Esther Borsi, Mitch Fruin, Lauren E. Walker, et al. A real-world evaluation of LLM medication safety reviews in NHS primary care.arXiv preprint arXiv:2512.21127, 2025

work page arXiv 2025
[62]

Large language model as clinical decision support system augments medication safety in 16 clinical specialties.npj Digital Medicine, 2025

Others. Large language model as clinical decision support system augments medication safety in 16 clinical specialties.npj Digital Medicine, 2025

2025
[63]

Grounding large language models in clinical evidence: A retrieval-augmented generation system for querying UK NICE clinical guidelines

Matthew Lewis, Samuel Thio, Amy Roberts, et al. Grounding large language models in clinical evidence: A retrieval-augmented generation system for querying UK NICE clinical guidelines. arXiv preprint arXiv:2510.02967, 2025

work page arXiv 2025
[64]

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Derek Wong et al. Prompt-level distillation.arXiv preprint arXiv:2602.21103, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Gemini 3.1 pro, 2026

Google DeepMind. Gemini 3.1 pro, 2026

2026
[66]

Gpt-5.4, 2026

OpenAI. Gpt-5.4, 2026. Accessed 2026-03-05

2026
[67]

MedGemma: Open medical foundation models, 2025

Google Research. MedGemma: Open medical foundation models, 2025

2025
[68]

TietAI Hydra Platform, 2026

Roberto Cruz. TietAI Hydra Platform, 2026

2026
[69]

TietAI Evals Public: Empirical analysis results for MDIA on HealthBench Professional, 2026

Cruz, Roberto, Rey-Blanco, David. TietAI Evals Public: Empirical analysis results for MDIA on HealthBench Professional, 2026. Public repository

2026
[70]

Chateval: Towards better llm-based evaluators through multi-agent debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. InInternational Conference on Learning Representations, 2024

2024
[71]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

2023
[72]

Multi-agent debate for llm judges with adaptive stability detection

Tianyu Hu, Zhen Tan, Song Wang, Huaizhi Qu, and Tianlong Chen. Multi-agent debate for llm judges with adaptive stability detection. InAdvances in Neural Information Processing Systems, 2025

2025
[73]

Laura Dietz, Oleg Zendel, Peter Bailey, Charles L. A. Clarke, Ellese Cotterill, Jeff Dalton, Faegheh Hasibi, Mark Sanderson, and Nick Craswell. Principles and guidelines for the use of llm judges. InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval, ICTIR ’25, pages 1–12. ACM, 2025

2025
[74]

Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering.arXiv preprint arXiv:2203.14371, 2022

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering.arXiv preprint arXiv:2203.14371, 2022

work page arXiv 2022
[75]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nahid Oufattole, Wei-Hung Weng, Hui Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. 24

2021
[76]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Sivasankar Kannan, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[77]

Large language models encode clinical knowledge.Nature, 620:172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, et al. Large language models encode clinical knowledge.Nature, 620:172–180, 2023

2023
[78]

Hashimoto

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024

2024
[79]

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Knight, Harry Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell Gordon, and Sydney Levine. Morebench: Evaluating procedural and pluralistic moral reasoning...

2025
[80]

Explaining length bias in llm-based preference evaluations, 2024

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhenyu Chen, Jianxun Lian, Nicholas Jing Yuan, Kaize Ding, and Hui Xiong. Explaining length bias in llm-based preference evaluations, 2024

2024

Showing first 80 references.

[1] [1]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. 19

2019

[2] [2]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020

[3] [3]

Passage re-ranking with BERT, 2019

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT, 2019

2019

[4] [4]

Improving efficient neural ranking models with cross-architecture knowledge distillation

Sebastian Hofstätter, Sophia Althammer, Mete Schröder, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation. In arXiv preprint arXiv:2010.02666, 2020

work page arXiv 2010

[5] [5]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020

[6] [6]

Benchmarking retrieval-augmented generation for medicine

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics (ACL Findings), 2024

2024

[7] [7]

Retrieval-augmented generation (RAG) in healthcare: A comprehensive review.AI (MDPI), 2025

Others. Retrieval-augmented generation (RAG) in healthcare: A comprehensive review.AI (MDPI), 2025

2025

[8] [8]

Explainable prediction of medical codes from clinical text.NAACL-HLT, 2018

James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text.NAACL-HLT, 2018

2018

[9] [9]

Shaoxiong Ji, Sina Pan, Erik Cambria, Pekka Marttinen, and Philip S. Yu. Does the magic of BERT apply to medical code assignment? a quantitative study.Computers in Biology and Medicine, 139:104998, 2021

2021

[10] [10]

Code synonyms do matter: Multiple synonyms matching network for automatic ICD coding

Zheng Yuan, Chuanqi Tan, and Songfang Huang. Code synonyms do matter: Multiple synonyms matching network for automatic ICD coding. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022

2022

[11] [11]

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

2020

[12] [12]

Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B

Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical BERT embeddings.Proceedings of the 2nd Clinical Natural Language Processing Workshop (NAACL), 2019

2019

[13] [13]

Pubmedqa: A dataset for biomedical research question answering.EMNLP, 2019

Qiao Jin, Bhuwan Dhingra, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.EMNLP, 2019

2019

[14] [14]

The state and fate of linguistic diversity and inclusion in the NLP world.Proceedings of the 58th Annual Meeting of the ACL, 2020

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world.Proceedings of the 58th Annual Meeting of the ACL, 2020

2020

[15] [15]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. 20

2020

[16] [16]

How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the ACL, 2019

Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the ACL, 2019

2019

[17] [17]

Making monolingual sentence embeddings multilingual using knowledge distillation

Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020

[18] [18]

Clinical natural language processing in languages other than English: Opportunities and challenges.Journal of Biomedical Semantics, 9(12), 2018

Aurélie Névéol, Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and Pierre Zweigen- baum. Clinical natural language processing in languages other than English: Opportunities and challenges.Journal of Biomedical Semantics, 9(12), 2018

2018

[19] [19]

Pretrained biomedical language models for clinical NLP in Spanish.Proceedings of the 21st Workshop on Biomedical Language Processing, BioNLP at ACL, 2022

Casimiro Pio Carrino, Joan Llop, Marc Pàmies, Asier Gutiérrez-Fandiño, Jordi Armengol- Estapé, Joaquín Silveira-Ocampo, Alfonso Valencia, Aitor Gonzalez-Agirre, and Marta Villegas. Pretrained biomedical language models for clinical NLP in Spanish.Proceedings of the 21st Workshop on Biomedical Language Processing, BioNLP at ACL, 2022

2022

[20] [20]

From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers

Anne Lauscher, Vinit Ravishankar, Ivan Vulić, and Goran Glavaš. From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. InEMNLP, 2020

2020

[21] [21]

Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT

Shijie Wu and Mark Dredze. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. InEMNLP-IJCNLP, 2019

2019

[22] [22]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning language models with self-generated instruc- tions. InProceedings of the 61st Annual Meeting of the ACL, 2023

2023

[23] [23]

InPars: Unsupervised dataset generation for information retrieval

Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. InPars: Unsupervised dataset generation for information retrieval. InSIGIR, 2022

2022

[24] [24]

Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith Hall, and Ming-Wei Chang

Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples, 2022

2022

[25] [25]

Smith, Nima PourNejatian, Anthony B

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E. Smith, Nima PourNejatian, Anthony B. Costa, Cheryl Martin, Mona G. Flores, Ying Zhang, Tanja Magoc, Gloria Lipori, Duane A. Mitchell, Naykky S. Ospina, Mustafa M. Ahmed, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian, and Yonghui Wu. A study of generative large language model for medical research and ...

2023

[26] [26]

Knowledge-infused prompting: Assessing and advancing clinical text data generation with large language models

Ran Xu, Hejie Cui, Yue Yu, Xuan Kan, Wenqi Shi, Yuchen Zhuang, Wei Jin, Joyce Ho, and Carl Yang. Knowledge-infused prompting: Assessing and advancing clinical text data generation with large language models. InFindings of the Association for Computational Linguistics: ACL 2024, 2024

2024

[27] [27]

Two directions for clinical data generation with large language models: Data-to-label and label-to-data

Rumeng Li, Xun Wang, and Hong Yu. Two directions for clinical data generation with large language models: Data-to-label and label-to-data. InFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

2023

[28] [28]

MedSyn: LLM-based synthetic medical text generation framework

Gleb Kumichev, Pavel Blinov, Yulia Kuzkina, Vasily Goncharov, Galina Zubkova, Nikolai Zenovkin, Aleksei Goncharov, and Andrey Savchenko. MedSyn: LLM-based synthetic medical text generation framework. InMachine Learning and Knowledge Discovery in Databases, 2024. 21

2024

[29] [29]

On LLMs-driven synthetic data generation, curation, and evaluation: A survey.arXiv preprint arXiv:2406.15126, 2024

Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On LLMs-driven synthetic data generation, curation, and evaluation: A survey.arXiv preprint arXiv:2406.15126, 2024

work page arXiv 2024

[30] [30]

The curse of recursion: Training on generated data makes models forget, 2023

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget, 2023

2023

[31] [31]

Efficient natural language response suggestion for Smart Reply, 2017

Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. Efficient natural language response suggestion for Smart Reply, 2017

2017

[32] [32]

Learning to rank: From pairwise approach to listwise approach

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: From pairwise approach to listwise approach. InProceedings of the 24th International Conference on Machine Learning, pages 129–136, 2007

2007

[33] [33]

Gemini 2.5 pro, 2025

Google DeepMind. Gemini 2.5 pro, 2025

2025

[34] [34]

Overview of automatic clinical coding: Annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of CLEF eHealth 2020

Antonio Miranda-Escalada, Aitor Gonzalez-Agirre, Jordi Armengol-Estapé, and Martin Krallinger. Overview of automatic clinical coding: Annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of CLEF eHealth 2020. InWorking Notes of CLEF 2020 – Conference and Labs of the Evaluation Forum, 2020

2020

[35] [35]

Unsupervised keyword combination query generation from online health related content for evidence-based fact checking

Pritam Deka and Anna Jurek-Loughrey. Unsupervised keyword combination query generation from online health related content for evidence-based fact checking. InThe 23rd International Conference on Information Integration and Web Intelligence, pages 267–277, 2021

2021

[36] [36]

MPNet: Masked and permuted pre-training for language understanding

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MPNet: Masked and permuted pre-training for language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020

[37] [37]

Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from spanish clinical cases

Antonio Miranda-Escalada, Luis Gascó, Salvador Lima-López, Eulàlia Farré-Maduell, Daniel Estrada, Anastasios Nentidis, Anastasia Krithara, Georgios Katsimpras, Georgios Paliouras, and Martin Krallinger. Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from spanish clinical cases. InWorking Notes of CLEF 2022, 2022

2022

[38] [38]

Centers for Medicare & Medicaid Services and National Center for Health Statistics.ICD-10- CM Official Guidelines for Coding and Reporting, FY 2024, 2024

2024

[39] [39]

Ministerio de Sanidad, Servicios Sociales e Igualdad, Gobierno de España.Manual de Codificación CIE-10-ES Diagnósticos, 6ªEdición, 2024

2024

[40] [40]

How good is your tokenizer? on the monolingual performance of multilingual language models

Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of ACL-IJCNLP, 2021

2021

[41] [41]

Mortensen, Noah A

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. InProceedings of EMNLP, 2023

2023

[42] [42]

Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 22

2023

[43] [43]

Multilingual clinical NER: Translation or cross-lingual transfer?Proceedings of the LREC BioNLP Workshop, 2019

Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre, Martin Krallinger, and Jordi Armengol- Estapé. Multilingual clinical NER: Translation or cross-lingual transfer?Proceedings of the LREC BioNLP Workshop, 2019

2019

[44] [44]

IAM at CLEF eHealth 2020: Concept annotation in Spanish electronic health records

Aitor Garcia-Pablos, Naiara Perez, and Montse Cuadros. IAM at CLEF eHealth 2020: Concept annotation in Spanish electronic health records. InWorking Notes of CLEF 2020, 2020

2020

[45] [45]

Integrating agentic artificial intelligence to automate ICD-10 medical coding.Informatics (MDPI), 2026

Preprints.org Authors. Integrating agentic artificial intelligence to automate ICD-10 medical coding.Informatics (MDPI), 2026

2026

[46] [46]

Fernando Gallego, Guillermo López-García, Luis Gascó-Sánchez, Martin Krallinger, and Francisco J. Veredas. ClinLinker: Medical entity linking of clinical concept mentions in Spanish. InInternational Conference on Computational Science (ICCS), 2024

2024

[47] [47]

PLM-ICD: Automatic ICD coding withpretrainedlanguagemodels

Chao-Wei Huang, Shang-Chi Tsai, and Yun-Nung Chen. PLM-ICD: Automatic ICD coding withpretrainedlanguagemodels. InProceedings of the 4th Clinical Natural Language Processing Workshop (ACL-ClinicalNLP), 2022

2022

[48] [48]

GoM-ICD: Automatic ICD coding with gap schemes and mixture of experts.Big Data Mining and Analytics, 2025

Yuxiang Pan et al. GoM-ICD: Automatic ICD coding with gap schemes and mixture of experts.Big Data Mining and Analytics, 2025

2025

[49] [49]

Automatic ICD coding using LLMs: a systematic review.medRxiv preprint, 2025

medRxiv Authors. Automatic ICD coding using LLMs: a systematic review.medRxiv preprint, 2025

2025

[50] [50]

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

OpenAI. HealthBench Professional: Evaluating large language models on real clinician chats. arXiv preprint arXiv:2604.27470, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Ruchir Arora et al. HealthBench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Introducing HealthBench, 2025

OpenAI. Introducing HealthBench, 2025

2025

[53] [53]

INFO Team

DR. INFO Team. OpenAI’s HealthBench in action: Evaluating an LLM-based medical assistant on realistic clinical queries.arXiv preprint arXiv:2509.02594, 2025

work page arXiv 2025

[54] [54]

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments.npj Digital Medicine, 2026

Samuel Schmidgall et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments.npj Digital Medicine, 2026

2026

[55] [55]

MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents.NEJM AI, 2025

Stanford ML Group et al. MedAgentBench: A realistic virtual EHR environment to benchmark medical LLM agents.NEJM AI, 2025

2025

[56] [56]

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, et al. PhysicianBench: Evaluating LLM agents in real-world EHR environments.arXiv preprint arXiv:2605.02240, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

MedMT-Bench: Can LLMs memorize and understand long multi-turn conversations in medical scenarios?arXiv preprint arXiv:2603.23519, 2026

Lin Yang, Yuancheng Yang, Xu Wang, Changkun Liu, and Haihua Yang. MedMT-Bench: Can LLMs memorize and understand long multi-turn conversations in medical scenarios?arXiv preprint arXiv:2603.23519, 2026

work page arXiv 2026

[58] [58]

MedAgents: Large language models as collaborators for zero-shot medical reasoning

Xiangru Tang et al. MedAgents: Large language models as collaborators for zero-shot medical reasoning. 2024

2024

[59] [59]

Glicksberg, Girish N

Alon Gorenshtein, Mahmud Omar, Benjamin S. Glicksberg, Girish N. Nadkarni, and Eyal Klang. AI agents in clinical medicine: A systematic review.medRxiv preprint, 2025. 23

2025

[60] [60]

Medical Reasoning with Large Language Models: A Survey and MR-Bench

Others. Medical reasoning with large language models: A survey and MR-Bench.arXiv preprint arXiv:2604.08559, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Walker, et al

Oliver Normand, Esther Borsi, Mitch Fruin, Lauren E. Walker, et al. A real-world evaluation of LLM medication safety reviews in NHS primary care.arXiv preprint arXiv:2512.21127, 2025

work page arXiv 2025

[62] [62]

Large language model as clinical decision support system augments medication safety in 16 clinical specialties.npj Digital Medicine, 2025

Others. Large language model as clinical decision support system augments medication safety in 16 clinical specialties.npj Digital Medicine, 2025

2025

[63] [63]

Grounding large language models in clinical evidence: A retrieval-augmented generation system for querying UK NICE clinical guidelines

Matthew Lewis, Samuel Thio, Amy Roberts, et al. Grounding large language models in clinical evidence: A retrieval-augmented generation system for querying UK NICE clinical guidelines. arXiv preprint arXiv:2510.02967, 2025

work page arXiv 2025

[64] [64]

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Derek Wong et al. Prompt-level distillation.arXiv preprint arXiv:2602.21103, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[65] [65]

Gemini 3.1 pro, 2026

Google DeepMind. Gemini 3.1 pro, 2026

2026

[66] [66]

Gpt-5.4, 2026

OpenAI. Gpt-5.4, 2026. Accessed 2026-03-05

2026

[67] [67]

MedGemma: Open medical foundation models, 2025

Google Research. MedGemma: Open medical foundation models, 2025

2025

[68] [68]

TietAI Hydra Platform, 2026

Roberto Cruz. TietAI Hydra Platform, 2026

2026

[69] [69]

TietAI Evals Public: Empirical analysis results for MDIA on HealthBench Professional, 2026

Cruz, Roberto, Rey-Blanco, David. TietAI Evals Public: Empirical analysis results for MDIA on HealthBench Professional, 2026. Public repository

2026

[70] [70]

Chateval: Towards better llm-based evaluators through multi-agent debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. InInternational Conference on Learning Representations, 2024

2024

[71] [71]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

2023

[72] [72]

Multi-agent debate for llm judges with adaptive stability detection

Tianyu Hu, Zhen Tan, Song Wang, Huaizhi Qu, and Tianlong Chen. Multi-agent debate for llm judges with adaptive stability detection. InAdvances in Neural Information Processing Systems, 2025

2025

[73] [73]

Laura Dietz, Oleg Zendel, Peter Bailey, Charles L. A. Clarke, Ellese Cotterill, Jeff Dalton, Faegheh Hasibi, Mark Sanderson, and Nick Craswell. Principles and guidelines for the use of llm judges. InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval, ICTIR ’25, pages 1–12. ACM, 2025

2025

[74] [74]

Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering.arXiv preprint arXiv:2203.14371, 2022

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering.arXiv preprint arXiv:2203.14371, 2022

work page arXiv 2022

[75] [75]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nahid Oufattole, Wei-Hung Weng, Hui Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. 24

2021

[76] [76]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Sivasankar Kannan, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[77] [77]

Large language models encode clinical knowledge.Nature, 620:172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, et al. Large language models encode clinical knowledge.Nature, 620:172–180, 2023

2023

[78] [78]

Hashimoto

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024

2024

[79] [79]

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Knight, Harry Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell Gordon, and Sydney Levine. Morebench: Evaluating procedural and pluralistic moral reasoning...

2025

[80] [80]

Explaining length bias in llm-based preference evaluations, 2024

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhenyu Chen, Jianxun Lian, Nicholas Jing Yuan, Kaize Ding, and Hui Xiong. Explaining length bias in llm-based preference evaluations, 2024

2024