New Benchmarking Shows Limited Generalization Power of TCR Antigenic Epitope Prediction Models

Bo Li; Keke Chen; Ning Jiang; Yiheng Li; Yiming Liao

arxiv: 2606.04994 · v1 · pith:6HM2DSQHnew · submitted 2026-06-03 · 💻 cs.LG · q-bio.QM

New Benchmarking Shows Limited Generalization Power of TCR Antigenic Epitope Prediction Models

Yiming Liao , Yiheng Li , Ning Jiang , Bo Li , Keke Chen This is my paper

Pith reviewed 2026-06-28 07:01 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords TCRepitope predictiongeneralizationbenchmark datasetsT cell receptorantigen specificitymachine learning models

0 comments

The pith

Existing TCR epitope prediction models show limited generalization to unseen antigens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current models for predicting which antigens T cell receptors will recognize do not perform well enough for real-world applications because they fail to generalize beyond their training data. A key problem has been the lack of benchmark datasets where the test cases are guaranteed to be new to the models. The authors introduce two new types of datasets designed to be unseen, allowing fair tests of how well models can predict specificity for novel epitopes. If these datasets work as intended, they will reveal the true limits of today's approaches and help build better ones for studying immune responses and engineering therapies.

Core claim

Models for predicting TCR-antigen specificity lack sufficient generalization power, as demonstrated by their performance on two new classes of rigorously defined unseen benchmark datasets that enable unbiased evaluation.

What carries the argument

Two complementary classes of rigorously defined unseen benchmark datasets for assessing TCR antigenic epitope prediction models.

If this is right

Current models will exhibit reduced sensitivity and specificity on the new unseen datasets compared to standard tests.
The new datasets will serve as a standard for evaluating and improving future prediction algorithms.
Accurate generalization testing will support scalable immune engineering applications.
Absence of such benchmarks has previously led to overoptimistic assessments of model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar issues of overfitting to seen data may affect other biological sequence prediction tasks.
Developing models that can handle truly novel epitopes may require incorporating structural or evolutionary information beyond sequence patterns.
These benchmarks could be extended to other immune receptor types like BCRs.
Performance on these sets might correlate with real-world utility in vaccine design or cancer immunotherapy.

Load-bearing premise

The two new dataset classes are truly unseen by the models being tested and any performance drop reflects genuine lack of generalization rather than how the datasets were built.

What would settle it

A model achieving high accuracy and specificity on both classes of these new unseen benchmark datasets would indicate that the claim of limited generalization does not hold.

Figures

Figures reproduced from arXiv: 2606.04994 by Bo Li, Keke Chen, Ning Jiang, Yiheng Li, Yiming Liao.

**Figure 1.** Figure 1: Systematic benchmarking of TCR–peptide binding models across diverse epitope landscapes. (a) Benchmark pipeline overview. Schematic of the evaluation workflow: three datasets (TetTCR-SeqHD, IMMREP23, Fingerprinting) undergo standardized preprocessing and model-specific data-leakage control before inference. Performance is quantified using macro-averaged partial AUC (pAUC0.1). (b) Performance on viral/sel… view at source ↗

read the original abstract

Accurate computational prediction of T cell receptor (TCR) antigen specificity would transform the study of T cell biology and enable scalable immune engineering, yet existing models lack sufficient sensitivity and specificity for broad applications. A major limitation is the absence of rigorously defined, unseen benchmark datasets that allow unbiased evaluation of model performance and generalizability. Here, we describe two complementary classes of datasets that meet this criterion and argue that they provide both a robust framework for model assessment and a foundation for next-generation TCR-antigen prediction algorithm development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's contribution is two new classes of TCR benchmark datasets meant to be strictly unseen, but the abstract gives zero details on construction or results so the generalization claim can't be evaluated yet.

read the letter

The core offering here is a pair of complementary benchmark dataset classes for TCR-epitope models that the authors say satisfy a stricter unseen criterion than prior sets. That directly targets a real and acknowledged weakness in the area: most existing test splits leak sequences or close homologs from the training corpora of the models being evaluated.

The paper does a clean job of stating the problem and positioning the new sets as a practical step forward for assessing sensitivity and specificity in applications like vaccine design. If the construction protocol turns out to be exhaustive on overlap checks against VDJdb, IEDB, McPAS-TCR and the specific training splits of the models they test, this could become a useful reference point.

The soft spot is obvious from the abstract: no description of how the datasets were built, what similarity thresholds were used, what overlap statistics were computed, and no numbers showing the claimed performance drops. The stress-test note is on target—the load-bearing assumption is that these sets contain nothing the evaluated models have seen, and nothing in the provided text verifies that. Without those checks the observed drops could be artifacts rather than evidence of limited generalization.

This is for people actively building or benchmarking TCR prediction models in computational immunology. A reader who needs better test sets would find the framing useful once the methods section is available. The work shows clear engagement with the literature on benchmark leakage, so it meets the bar for serious refereeing even though the current draft is thin on evidence. I would send it out for review focused on the dataset construction protocol and the quantitative results.

Referee Report

2 major / 0 minor

Summary. The paper claims that existing TCR-epitope prediction models exhibit limited generalization, as evidenced by performance drops when evaluated on two new complementary classes of rigorously defined unseen benchmark datasets introduced by the authors; these datasets are positioned as a framework for unbiased model assessment and future algorithm development.

Significance. If the datasets are verifiably free of overlap with training corpora and the reported performance drops hold, the work would provide a valuable new evaluation standard in TCR specificity prediction, addressing a recognized gap in the field and potentially guiding more robust model development.

major comments (2)

[Dataset description / Methods] The manuscript provides no details on dataset construction protocols, including any overlap or homology checks against standard corpora (VDJdb, IEDB, McPAS-TCR) or model-specific training splits, nor on similarity thresholds used to ensure the sets are unseen; this is load-bearing for the central claim of true generalization failure rather than leakage or artifact.
[Results / Abstract] No quantitative results, specific models evaluated, performance metrics (e.g., AUC, sensitivity/specificity), or validation procedures are supplied, preventing assessment of whether the claimed drops support the generalization conclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improvement in our manuscript. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses

Referee: [Dataset description / Methods] The manuscript provides no details on dataset construction protocols, including any overlap or homology checks against standard corpora (VDJdb, IEDB, McPAS-TCR) or model-specific training splits, nor on similarity thresholds used to ensure the sets are unseen; this is load-bearing for the central claim of true generalization failure rather than leakage or artifact.

Authors: We agree with the referee that detailed information on dataset construction is crucial for validating the claims of limited generalization. The current manuscript's description is indeed high-level. In the revised manuscript, we will add a comprehensive Methods section detailing the protocols for creating the two classes of unseen benchmark datasets. This will include: (1) exact procedures for selecting TCR-epitope pairs, (2) overlap and homology checks against VDJdb, IEDB, and McPAS-TCR using specific tools and thresholds (e.g., sequence similarity <30% identity or e-value thresholds), (3) verification against model-specific training splits, and (4) justification of the similarity thresholds to ensure the sets are truly unseen. These additions will allow readers to confirm the absence of leakage or artifacts. revision: yes
Referee: [Results / Abstract] No quantitative results, specific models evaluated, performance metrics (e.g., AUC, sensitivity/specificity), or validation procedures are supplied, preventing assessment of whether the claimed drops support the generalization conclusion.

Authors: We acknowledge that the abstract does not contain quantitative results, and the manuscript would be strengthened by including them explicitly. While the full text discusses the performance drops, we will revise the abstract to summarize key quantitative findings, such as the specific models evaluated (e.g., several deep learning-based TCR-epitope predictors), the performance metrics used (AUC-ROC, sensitivity, specificity), and the observed drops on the new benchmarks compared to standard evaluations. Additionally, we will detail the validation procedures in the Results section to better support the generalization conclusion. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark dataset construction is an empirical claim, not a self-referential derivation

full rationale

The paper introduces two classes of benchmark datasets claimed to be rigorously unseen for evaluating TCR-epitope models. No equations, parameters, or derivations exist that could reduce to inputs by construction. The central assertion (unseen status enabling unbiased evaluation of generalization) is an empirical statement about data partitioning and overlap checks, not a tautological redefinition or fitted-input prediction. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear in the provided text. This is a standard non-circular benchmarking contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are visible in the abstract. The central claim rests on the unshown construction details of the two dataset classes.

pith-pipeline@v0.9.1-grok · 5618 in / 1053 out tokens · 22192 ms · 2026-06-28T07:01:17.892752+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references

[1]

& Davis, M.M

Hedrick, S.M., Cohen, D.I., Nielsen, E.A. & Davis, M.M. Isolation of cDNA clones encoding T cell-specific membrane-associated proteins.Nature308, 149-153 (1984)

1984
[2]

Yanagi, Y. et al. A human T cell-specific cDNA clone encodes a protein having extensive homology to immunoglobulin chains.Nature308, 145-149 (1984)

1984
[3]

The major histocompatibility complex determines susceptibility to cytotoxic T cells directed against minor histocompatibility antigens.J Exp Med142, 1349-1364 (1975)

Bevan, M.J. The major histocompatibility complex determines susceptibility to cytotoxic T cells directed against minor histocompatibility antigens.J Exp Med142, 1349-1364 (1975)

1975
[4]

& Samelson, L.E

Gordon, R.D., Simpson, E. & Samelson, L.E. In vitro cell-mediated immune responses to the male specific(H-Y) antigen in mice.J Exp Med142, 1108-1120 (1975). 5

1975
[5]

& Marrack, P.C

Kappler, J.W. & Marrack, P.C. Helper T cells recognise antigen and macrophage surface com- ponents simultaneously.Nature262, 797-799 (1976)

1976
[6]

Bjorkman, P.J. et al. The foreign antigen binding site and T cell recognition regions of class I histocompatibility antigens.Nature329, 512-518 (1987)

1987
[7]

Garcia, K.C. et al. An alphabeta T cell receptor structure at 2.5 A and its orientation in the TCR-MHC complex.Science274, 209-219 (1996)

1996
[8]

Ma, K.-Y. et al. High-throughput and high-dimensional single-cell analysis of antigen-specific CD8+ T cells.Nature Immunology22, 1590-1598 (2021)

2021
[9]

Malone, M.J. et al. Resistance Potential of the HLA-A2-restricted Immunodominant SARS- CoV-2 Specific CD8+ T Cell Receptor Repertoire to Antigenic Drift.Nature Communications, Accepted
[10]

Assessment of computational methods in predicting TCR-epitope binding recognition

Nielsen, M. et al. Lessons learned from the IMMREP23 TCR-epitope prediction challenge. ImmunoInformatics16, 100045 (2024). Lu, Y., Y. Wang, M. Xu, B. Xie, Y. Yang, H. Xu and S. Suo (2026). "Assessment of computational methods in predicting TCR-epitope binding recognition."Nat Methods23(1): 248-259. Ma, K.-Y., A. A. Schonnesen, C. He, A. Y. Xia, E. Sun, E....

2024

[1] [1]

& Davis, M.M

Hedrick, S.M., Cohen, D.I., Nielsen, E.A. & Davis, M.M. Isolation of cDNA clones encoding T cell-specific membrane-associated proteins.Nature308, 149-153 (1984)

1984

[2] [2]

Yanagi, Y. et al. A human T cell-specific cDNA clone encodes a protein having extensive homology to immunoglobulin chains.Nature308, 145-149 (1984)

1984

[3] [3]

The major histocompatibility complex determines susceptibility to cytotoxic T cells directed against minor histocompatibility antigens.J Exp Med142, 1349-1364 (1975)

Bevan, M.J. The major histocompatibility complex determines susceptibility to cytotoxic T cells directed against minor histocompatibility antigens.J Exp Med142, 1349-1364 (1975)

1975

[4] [4]

& Samelson, L.E

Gordon, R.D., Simpson, E. & Samelson, L.E. In vitro cell-mediated immune responses to the male specific(H-Y) antigen in mice.J Exp Med142, 1108-1120 (1975). 5

1975

[5] [5]

& Marrack, P.C

Kappler, J.W. & Marrack, P.C. Helper T cells recognise antigen and macrophage surface com- ponents simultaneously.Nature262, 797-799 (1976)

1976

[6] [6]

Bjorkman, P.J. et al. The foreign antigen binding site and T cell recognition regions of class I histocompatibility antigens.Nature329, 512-518 (1987)

1987

[7] [7]

Garcia, K.C. et al. An alphabeta T cell receptor structure at 2.5 A and its orientation in the TCR-MHC complex.Science274, 209-219 (1996)

1996

[8] [8]

Ma, K.-Y. et al. High-throughput and high-dimensional single-cell analysis of antigen-specific CD8+ T cells.Nature Immunology22, 1590-1598 (2021)

2021

[9] [9]

Malone, M.J. et al. Resistance Potential of the HLA-A2-restricted Immunodominant SARS- CoV-2 Specific CD8+ T Cell Receptor Repertoire to Antigenic Drift.Nature Communications, Accepted

[10] [10]

Assessment of computational methods in predicting TCR-epitope binding recognition

Nielsen, M. et al. Lessons learned from the IMMREP23 TCR-epitope prediction challenge. ImmunoInformatics16, 100045 (2024). Lu, Y., Y. Wang, M. Xu, B. Xie, Y. Yang, H. Xu and S. Suo (2026). "Assessment of computational methods in predicting TCR-epitope binding recognition."Nat Methods23(1): 248-259. Ma, K.-Y., A. A. Schonnesen, C. He, A. Y. Xia, E. Sun, E....

2024