pith. machine review for the scientific record. sign in

arxiv: 2601.19667 · v2 · submitted 2026-01-27 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: unknown

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

Authors on Pith no claims yet
classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords syncabelmodelssyntheticbiomedicaldataentitylinkingaugmentation
0
0 comments X
read the original abstract

We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference, establishes new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research: - HuggingFace Datasets & Models - GitHub Repository

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking

    cs.CL 2026-05 unverdicted novelty 7.0

    LongBEL improves biomedical entity linking consistency by combining full-document context with memory of previous predictions trained via cross-validation rather than gold labels.