AEGIS: A Multi-Task Joint-Embedding Predictive Architecture for Mammography

Lakshman Tamil; Sai Karthik Navuluru; Scott Chase Waggener

arxiv: 2607.00277 · v1 · pith:AUJKIE5Gnew · submitted 2026-06-30 · 💻 cs.CV

AEGIS: A Multi-Task Joint-Embedding Predictive Architecture for Mammography

Scott Chase Waggener , Sai Karthik Navuluru , Lakshman Tamil This is my paper

Pith reviewed 2026-07-02 19:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords mammographybreast cancer detectionself-supervised learningvision transformerbreast density assessmentjoint embedding predictive architecturemedical imaging triage

0 comments

The pith

A joint-embedding predictive model reaches 0.949 AUC for breast cancer triage after pre-training on 71,103 studies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Aegis, a multi-task architecture that pre-trains Vision Transformers with joint-embedding predictive methods on a large multi-site mammography corpus before fine-tuning for cancer detection and density assessment. On a held-out test set the largest variant delivers 0.949 AUC for triage at 93 percent sensitivity and 75 percent specificity, with further gains from ensembling against an existing cleared system. The same model also classifies breast density at 0.953 AUC for the binary task and 62.6 percent exact accuracy across four categories, plus 0.871 AUC in zero-shot external validation. These results matter to a reader because they show how self-supervised pre-training on unlabeled clinical data can support high-stakes screening tasks that usually demand large labeled sets.

Core claim

Aegis is a joint-embedding predictive architecture built on Vision Transformer variants that first learns representations via self-supervised JEPA pre-training across 71,103 studies from 14 sites, then fine-tunes with progressive resolution scaling to perform simultaneous breast cancer triage and BI-RADS density classification; the largest model records 0.949 AUC on triage, 0.953 AUC on binary density, 62.6 percent exact four-class density accuracy, and 0.871 AUC zero-shot on an external public dataset.

What carries the argument

Joint-embedding predictive architecture (JEPA) pre-training of Vision Transformer Small/Base/Large variants, followed by supervised multi-task fine-tuning for cancer detection and density assessment.

If this is right

The largest model alone can operate at 93 percent sensitivity with 75 percent specificity for cancer triage.
An ensemble with an FDA-cleared baseline raises discrimination to 0.952 AUC.
Density classification reaches 98.8 percent adjacent accuracy, matching reported human inter-reader levels.
Zero-shot transfer to VinDr-Mammo yields 0.871 AUC under a different reference standard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-training recipe could be applied to other high-volume imaging modalities where labeled outcomes are expensive to obtain.
Progressive resolution scaling during fine-tuning may be reusable for any medical imaging task that benefits from both global context and fine local detail.
If the test-set curation process inadvertently favored easier cases, real-world sensitivity at the reported operating point would be lower than claimed.
Multi-task training that jointly optimizes cancer detection and density may reduce the total annotation burden compared with training separate models.

Load-bearing premise

The curated 785-study test set is representative of real-world clinical distributions and free of selection bias relative to the 71,103-study pre-training corpus.

What would settle it

A statistically significant drop below 0.90 AUC when the same model is evaluated on an unselected consecutive series of screening mammograms collected after the study period from a site not represented in the original 14.

Figures

Figures reproduced from arXiv: 2607.00277 by Lakshman Tamil, Sai Karthik Navuluru, Scott Chase Waggener.

**Figure 1.** Figure 1: JEPA pre-training architecture. A student network processes masked mammograms while a teacher network (updated via EMA) processes the unmasked [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Supervised fine-tuning architecture. The pre-trained ViT backbone processes mammograms and outputs a grid of visual tokens along with 4 CLS [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of data partitioning in the MedCognetics mammography database. The pretrain set is used for self-supervised learning and contains the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Receiver operating characteristic (ROC) curves for breast cancer triage (Triage+Detection score; CNN: Triage only) on the development set. Operating [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: ROC curves for breast cancer triage (Triage+Detection score; CNN: Triage only) on the test set. Operating points indicate the threshold selected via [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We present Aegis, a joint-embedding predictive architecture for breast cancer detection and density assessment in mammography. We train three Vision Transformer variants (Small/Base/Large) using self-supervised joint-embedding predictive architecture (JEPA) pre-training on 71,103 studies from 14 clinical sites, followed by supervised fine-tuning with progressive resolution scaling up to 2048x1536. On a curated 785-study test set, our largest model achieves area under the receiver operating characteristic curve (AUC) 0.949 for breast cancer triage with 93% sensitivity and 75% specificity at the optimal operating point. An ensemble combining our model with a U.S. Food and Drug Administration-cleared baseline further improves discrimination to 0.952 AUC. For breast density classification, the model achieves 0.953 AUC for binary (dense vs. non-dense) classification and 62.6% exact accuracy across four Breast Imaging Reporting and Data System (BI-RADS) categories, with 98.8% adjacent accuracy comparable to reported human inter-reader agreement. External validation on the public VinDr-Mammo dataset provides evidence of cross-population transfer under a different reference standard, with the largest model achieving 0.871 AUC for triage in a zero-shot setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies existing JEPA pre-training to mammography at scale and reports usable AUC numbers, but the curated test set leaves generalizability unclear.

read the letter

The main point is that this paper takes the established JEPA self-supervised approach, pre-trains ViT models on 71k studies from 14 sites, and fine-tunes for cancer triage plus breast density classification. On their 785-study hold-out they reach 0.949 AUC for triage (93% sens at 75% spec) and 0.953 AUC for binary density, with an ensemble bump to 0.952 and 0.871 AUC zero-shot on VinDr-Mammo.

They do the multi-task setup and progressive resolution scaling cleanly, and the density adjacent accuracy of 98.8% matches typical human agreement. The multi-site pre-training and external validation are the parts that actually add some evidence of transfer.

The soft spot is exactly the one flagged in the stress test: the test set is called "curated" with no selection protocol, prevalence tables, or comparison to the pre-training distribution. Without that, the headline numbers rest on an unverified assumption that the hold-out matches real clinical streams. The abstract also gives no statistical tests or confidence intervals, and the only baseline is a single FDA-cleared system.

This is for groups already working on self-supervised medical imaging who want to see JEPA numbers in mammography. It is not a methods paper, so the central claims are empirical rather than theoretical. The work shows clear thinking on the application side and honest reporting of the external validation.

I would send it for peer review so the data curation and evaluation details can be checked; the scale and transfer results are worth referee time even if revisions are needed on the test-set description.

Referee Report

2 major / 3 minor

Summary. The manuscript presents AEGIS, a multi-task joint-embedding predictive architecture (JEPA) using Vision Transformer variants (Small/Base/Large) for breast cancer triage and BI-RADS density assessment in mammography. It reports self-supervised pre-training on 71,103 studies from 14 sites, followed by supervised fine-tuning with progressive resolution scaling. On a curated 785-study test set the largest model reaches AUC 0.949 for triage (93% sensitivity, 75% specificity at optimal point), an ensemble with an FDA-cleared system reaches 0.952 AUC, binary density AUC 0.953, 62.6% exact 4-class accuracy (98.8% adjacent), and zero-shot AUC 0.871 on VinDr-Mammo.

Significance. If the test-set results generalize, the work provides evidence that large-scale multi-site JEPA pre-training can support competitive multi-task performance in mammography, including cross-population transfer on an external public dataset. The scale of pre-training data and the inclusion of external validation are positive features. The significance is limited by the absence of any characterization of the test-set curation process or statistical analysis of the reported metrics.

major comments (2)

[Abstract] Abstract: the triage performance (AUC 0.949, 93%/75% operating point) and ensemble improvement (to 0.952) are reported exclusively on a 'curated 785-study test set' with no description of selection criteria, stratification by site/prevalence/density/acquisition parameters, or any demographic/prevalence tables comparing the test set to the 71,103-study pre-training corpus. Because the central clinical-utility claim rests on this held-out evaluation, the lack of curation protocol makes it impossible to determine whether the result is representative or affected by selection bias.
[Abstract] Abstract / Results: no confidence intervals, statistical tests, or p-values are supplied for any AUC value, sensitivity/specificity point, or comparison against the FDA-cleared baseline. This information is required to support claims of improvement and clinical relevance.

minor comments (3)

[Abstract] Abstract: the construction of the ensemble (score averaging, weighting, etc.) is not described.
[Abstract] Abstract: no information is given on how class imbalance was handled during fine-tuning or evaluation.
[Abstract] Abstract: the exact number of parameters and training hyperparameters for the three ViT variants are not stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the importance of test-set transparency and statistical rigor. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the triage performance (AUC 0.949, 93%/75% operating point) and ensemble improvement (to 0.952) are reported exclusively on a 'curated 785-study test set' with no description of selection criteria, stratification by site/prevalence/density/acquisition parameters, or any demographic/prevalence tables comparing the test set to the 71,103-study pre-training corpus. Because the central clinical-utility claim rests on this held-out evaluation, the lack of curation protocol makes it impossible to determine whether the result is representative or affected by selection bias.

Authors: We agree that the current manuscript lacks sufficient detail on test-set curation, which limits assessment of representativeness and potential bias. In the revised version we will add a dedicated Methods subsection describing the curation protocol for the 785-study test set, including explicit selection criteria, any stratification by site/prevalence/density/acquisition parameters, and comparative tables or statistics on demographics and prevalence versus the 71,103-study pre-training corpus. revision: yes
Referee: [Abstract] Abstract / Results: no confidence intervals, statistical tests, or p-values are supplied for any AUC value, sensitivity/specificity point, or comparison against the FDA-cleared baseline. This information is required to support claims of improvement and clinical relevance.

Authors: We acknowledge the absence of confidence intervals and formal statistical testing. In the revision we will report 95% bootstrap confidence intervals for all AUC, sensitivity, and specificity values and will include statistical comparisons (e.g., DeLong test) with p-values for the ensemble versus the FDA-cleared baseline to substantiate claims of improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on held-out data with no derivations or self-referential fits

full rationale

The paper reports standard self-supervised pre-training (JEPA on 71,103 studies) followed by supervised fine-tuning and evaluation on a separate 785-study curated test set plus external validation on VinDr-Mammo. No equations, derivations, fitted parameters re-labeled as predictions, or load-bearing self-citations appear in the provided text. Performance metrics (AUC 0.949, etc.) are direct empirical measurements on held-out data rather than quantities forced by construction from the inputs. This matches the default expectation for non-circular empirical ML papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no model equations, hyperparameter tables, or explicit assumptions are provided, so the ledger remains empty pending full text.

pith-pipeline@v0.9.1-grok · 5768 in / 1246 out tokens · 33873 ms · 2026-07-02T19:02:11.838279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Global cancer statistics 2022: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,

F. Bray, M. Laversanne, H. Sung, J. Ferlay, R. L. Siegel, I. Soer- jomataram, and A. Jemal, “Global cancer statistics 2022: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,”CA: A Cancer Journal for Clinicians, vol. 74, no. 3, pp. 229–263, 2024

2022
[2]

Breast cancer,

W. H. Organization, “Breast cancer,” Fact sheet, April 2026, available from: https://www.who.int/news-room/fact-sheets/detail/breast-cancer

2026
[3]

Global patterns and trends in breast cancer incidence and mortality across 185 countries,

J. Kim, A. Harper, V . McCormack, H. Sung, N. Houssami, E. Morgan, M. M. Fidler-Benaoudia, I. Soerjomataram, and F. Bray, “Global patterns and trends in breast cancer incidence and mortality across 185 countries,”Nature Medicine, February 2025. [Online]. Available: https://www.nature.com/articles/s41591-025-03502-3

2025
[4]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” 2023

2023
[5]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” 2020

2020
[6]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2019

2019
[7]

Bootstrap your own latent: A new approach to self-supervised learning,

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent: A new approach to self-supervised learning,” 2020

2020
[8]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” 2021

2021
[9]

Beit: Bert pre-training of image transformers,

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” 2021

2021
[10]

ibot: Image bert pre-training with online tokenizer,

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,” 2021

2021
[11]

Emerging properties in self-supervised vision transformers,

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” 2021

2021
[12]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supe...

2023
[13]

Revisiting feature prediction for learning visual representations from video,

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas, “Revisiting feature prediction for learning visual representations from video,” 2024

2024
[14]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

R. Balestriero and Y . LeCun, “Lejepa: Provable and scalable self- supervised learning without the heuristics,” 2025. [Online]. Available: https://arxiv.org/abs/2511.08544

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” 2021. [Online]. Available: https://arxiv.org/abs/2012. 12877

2021
[17]

Escaping the big data paradigm with compact transformers,

A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi, “Escaping the big data paradigm with compact transformers,” 2022. [Online]. Available: https://arxiv.org/abs/2104.05704

work page arXiv 2022
[18]

Transformers in medical imaging: A survey,

F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, and H. Fu, “Transformers in medical imaging: A survey,”Medical Image Analysis, vol. 88, p. 102802, 2023

2023
[19]

Vision transformers need registers,

T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” 2024. [Online]. Available: https://arxiv.org/abs/2309. 16588

2024
[20]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “DINOv3,” 2025. [Online]. Available: h...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Root mean square layer normalization,

B. Zhang and R. Sennrich, “Root mean square layer normalization,”
[22]

Root Mean Square Layer Normalization

[Online]. Available: https://arxiv.org/abs/1910.07467

work page internal anchor Pith review Pith/arXiv arXiv 1910
[23]

Vision-transformer-based transfer learning for mammogram classification,

G. Ayana, K. Dese, and S.-w. Choe, “Vision-transformer-based transfer learning for mammogram classification,”Diagnostics, vol. 13, no. 2, p. 178, 2023

2023
[24]

Detection of breast cancer in digital breast tomosynthesis with vision transformers,

I. Kassis, D. Lederman, G. Ben-Arie, M. Giladi Rosenthal, I. Shelef, and Y . Zigel, “Detection of breast cancer in digital breast tomosynthesis with vision transformers,”Scientific Reports, vol. 14, p. 22149, 2024

2024
[25]

Kashiwada, E

Y . Kashiwada, E. Takaya, M. Hiroya, N. Matsuda, T. Yashima, T. Kobayashi, G. Tamiya, and T. Ueda, “Applying vision transformer to assess multi-scale morphological features in mammography for breast cancer detection: multiscale image morphological extraction vision transformer (MIME-ViT),”PeerJ Computer Science, vol. 11, p. e3252, 2025

2025
[26]

International evaluation of an AI system for breast cancer screening,

S. M. McKinney, M. Sieniek, V . Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back, M. Chesus, G. S. Corrado, A. Darziet al., “International evaluation of an AI system for breast cancer screening,” Nature, vol. 577, no. 7788, pp. 89–94, 2020

2020
[27]

Nation- wide real-world implementation of AI for cancer detection in population- based mammography screening,

N. Eisemann, S. Bunk, T. Mukama, H. Baltus, S. A. Elsner, T. Gomille, G. Hecht, S. Heywang-K”obrunner, R. Rathmann, K. Siegmann-Luz, T. T”ollner, T. Werner V omweg, C. Leibig, and A. Katalinic, “Nation- wide real-world implementation of AI for cancer detection in population- based mammography screening,”Nature Medicine, vol. 31, pp. 917–924, 2025

2025
[28]

A review of the role of augmented intelligence in breast imaging: From automated breast density assessment to risk stratification,

A. Arieno, A. Chan, and S. V . Destounis, “A review of the role of augmented intelligence in breast imaging: From automated breast density assessment to risk stratification,”American Journal of Roentgenology, vol. 212, no. 2, pp. 259–270, 2019, pMID: 30422711. [Online]. Available: https://doi.org/10.2214/AJR.18.20391

work page doi:10.2214/ajr.18.20391 2019
[29]

Variation in mammographic breast density assessments among radiologists in clinical practice: a multicenter observational study,

B. L. Sprague, E. F. Conant, T. Onega, M. P. Garcia, E. F. Beaber, S. D. Herschorn, C. D. Lehman, A. N. A. Tosteson, R. Lacson, M. D. Schnall, D. Kontos, J. S. Haas, D. L. Weaver, W. E. Barlow, and PROSPR Consortium, “Variation in mammographic breast density assessments among radiologists in clinical practice: a multicenter observational study,”Annals of ...

2016
[30]

Breast cancer risk assessment in the ai era: The importance of model validation in ethnically diverse cohorts,

D. Kontos and J. Kalpathy-Cramer, “Breast cancer risk assessment in the ai era: The importance of model validation in ethnically diverse cohorts,”Radiology: Artificial Intelligence, vol. 5, no. 6, p. e230462,
[31]

Available: https://doi.org/10.1148/ryai.230462

[Online]. Available: https://doi.org/10.1148/ryai.230462

work page doi:10.1148/ryai.230462
[32]

Vindr-mammo: A large-scale benchmark dataset for computer- aided diagnosis in full-field digital mammography,

H. T. Nguyen, H. Q. Nguyen, H. H. Pham, K. Lam, L. T. Le, M. Dao, and V . Vu, “Vindr-mammo: A large-scale benchmark dataset for computer- aided diagnosis in full-field digital mammography,”Scientific Data, vol. 10, p. 277, 2023

2023
[33]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231223011864

2024
[34]

GLU Variants Improve Transformer

N. Shazeer, “Glu variants improve transformer,” 2020. [Online]. Available: https://arxiv.org/abs/2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

Cluster and predict latent patches for improved masked image mod- eling,

T. Darcet, F. Baldassarre, M. Oquab, J. Mairal, and P. Bojanowski, “Cluster and predict latent patches for improved masked image mod- eling,” 2025

2025
[36]

Objects as Points

X. Zhou, D. Wang, and P. Kr ¨ahenb¨uhl, “Objects as points,” inarXiv preprint arXiv:1904.07850, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[37]

Index for rating diagnostic tests,

W. J. Youden, “Index for rating diagnostic tests,”Cancer, vol. 3, no. 1, pp. 32–35, 1950

1950
[38]

ROC curves in clinical chemistry: uses, misuses, and possible solutions,

N. A. Obuchowski, M. L. Lieber, and F. H. Wians, “ROC curves in clinical chemistry: uses, misuses, and possible solutions,”Clinical Chemistry, vol. 50, no. 7, pp. 1118–1125, 2004

2004
[39]

United states cancer statistics: Data visualizations,

U.S. Centers for Disease Control and Prevention, “United states cancer statistics: Data visualizations,” https://gis.cdc.gov/Cancer/USCS/, 2023, accessed: August 2023

2023
[40]

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach,

E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson, “Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach,”Biometrics, vol. 44, no. 3, pp. 837– 845, 1988

1988
[41]

Optimizing breast cancer detection in mammograms: A comprehensive study of transfer learning, resolution reduction, and multi-view classification,

D. G. P. Petrini and H. Y . Kim, “Optimizing breast cancer detection in mammograms: A comprehensive study of transfer learning, resolution reduction, and multi-view classification,” 2025

2025
[42]

MamT4: Multi-view attention networks for mammog- raphy cancer classification,

A. Ibragimov, S. Senotrusova, A. Litvinov, E. Ushakov, E. Karpulevich, and Y . Markin, “MamT4: Multi-view attention networks for mammog- raphy cancer classification,” in2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2024, pp. 1965–1970

2024
[43]

Dual view deep learning for en- hanced breast cancer screening using mammography,

S. R. Kebede, F. G. Waldamichael, T. G. Debelee, M. Aleme, W. Bedane, B. Mezgebu, and Z. C. Merga, “Dual view deep learning for en- hanced breast cancer screening using mammography,”Scientific Reports, vol. 14, p. 3839, 2024

2024
[44]

AUCReshaping: improved sensitivity at high-specificity,

S. Bhat, A. Mansoor, B. Georgescuet al., “AUCReshaping: improved sensitivity at high-specificity,”Scientific Reports, vol. 13, p. 21097, 2023

2023
[45]

T-synth: A knowledge-based dataset of synthetic breast images,

C. Wiedeman, A. Sarmakeeva, E. Sizikova, D. Filienko, M. Lago, J. G. Delfino, and A. Badano, “T-synth: A knowledge-based dataset of synthetic breast images,”arXiv preprint arXiv:2507.04038, 2025

work page arXiv 2025

[1] [1]

Global cancer statistics 2022: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,

F. Bray, M. Laversanne, H. Sung, J. Ferlay, R. L. Siegel, I. Soer- jomataram, and A. Jemal, “Global cancer statistics 2022: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,”CA: A Cancer Journal for Clinicians, vol. 74, no. 3, pp. 229–263, 2024

2022

[2] [2]

Breast cancer,

W. H. Organization, “Breast cancer,” Fact sheet, April 2026, available from: https://www.who.int/news-room/fact-sheets/detail/breast-cancer

2026

[3] [3]

Global patterns and trends in breast cancer incidence and mortality across 185 countries,

J. Kim, A. Harper, V . McCormack, H. Sung, N. Houssami, E. Morgan, M. M. Fidler-Benaoudia, I. Soerjomataram, and F. Bray, “Global patterns and trends in breast cancer incidence and mortality across 185 countries,”Nature Medicine, February 2025. [Online]. Available: https://www.nature.com/articles/s41591-025-03502-3

2025

[4] [4]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” 2023

2023

[5] [5]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” 2020

2020

[6] [6]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2019

2019

[7] [7]

Bootstrap your own latent: A new approach to self-supervised learning,

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent: A new approach to self-supervised learning,” 2020

2020

[8] [8]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” 2021

2021

[9] [9]

Beit: Bert pre-training of image transformers,

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” 2021

2021

[10] [10]

ibot: Image bert pre-training with online tokenizer,

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,” 2021

2021

[11] [11]

Emerging properties in self-supervised vision transformers,

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” 2021

2021

[12] [12]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supe...

2023

[13] [13]

Revisiting feature prediction for learning visual representations from video,

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas, “Revisiting feature prediction for learning visual representations from video,” 2024

2024

[14] [14]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

R. Balestriero and Y . LeCun, “Lejepa: Provable and scalable self- supervised learning without the heuristics,” 2025. [Online]. Available: https://arxiv.org/abs/2511.08544

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” 2021. [Online]. Available: https://arxiv.org/abs/2012. 12877

2021

[17] [17]

Escaping the big data paradigm with compact transformers,

A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi, “Escaping the big data paradigm with compact transformers,” 2022. [Online]. Available: https://arxiv.org/abs/2104.05704

work page arXiv 2022

[18] [18]

Transformers in medical imaging: A survey,

F. Shamshad, S. Khan, S. W. Zamir, M. H. Khan, M. Hayat, F. S. Khan, and H. Fu, “Transformers in medical imaging: A survey,”Medical Image Analysis, vol. 88, p. 102802, 2023

2023

[19] [19]

Vision transformers need registers,

T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” 2024. [Online]. Available: https://arxiv.org/abs/2309. 16588

2024

[20] [20]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “DINOv3,” 2025. [Online]. Available: h...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Root mean square layer normalization,

B. Zhang and R. Sennrich, “Root mean square layer normalization,”

[22] [22]

Root Mean Square Layer Normalization

[Online]. Available: https://arxiv.org/abs/1910.07467

work page internal anchor Pith review Pith/arXiv arXiv 1910

[23] [23]

Vision-transformer-based transfer learning for mammogram classification,

G. Ayana, K. Dese, and S.-w. Choe, “Vision-transformer-based transfer learning for mammogram classification,”Diagnostics, vol. 13, no. 2, p. 178, 2023

2023

[24] [24]

Detection of breast cancer in digital breast tomosynthesis with vision transformers,

I. Kassis, D. Lederman, G. Ben-Arie, M. Giladi Rosenthal, I. Shelef, and Y . Zigel, “Detection of breast cancer in digital breast tomosynthesis with vision transformers,”Scientific Reports, vol. 14, p. 22149, 2024

2024

[25] [25]

Kashiwada, E

Y . Kashiwada, E. Takaya, M. Hiroya, N. Matsuda, T. Yashima, T. Kobayashi, G. Tamiya, and T. Ueda, “Applying vision transformer to assess multi-scale morphological features in mammography for breast cancer detection: multiscale image morphological extraction vision transformer (MIME-ViT),”PeerJ Computer Science, vol. 11, p. e3252, 2025

2025

[26] [26]

International evaluation of an AI system for breast cancer screening,

S. M. McKinney, M. Sieniek, V . Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back, M. Chesus, G. S. Corrado, A. Darziet al., “International evaluation of an AI system for breast cancer screening,” Nature, vol. 577, no. 7788, pp. 89–94, 2020

2020

[27] [27]

Nation- wide real-world implementation of AI for cancer detection in population- based mammography screening,

N. Eisemann, S. Bunk, T. Mukama, H. Baltus, S. A. Elsner, T. Gomille, G. Hecht, S. Heywang-K”obrunner, R. Rathmann, K. Siegmann-Luz, T. T”ollner, T. Werner V omweg, C. Leibig, and A. Katalinic, “Nation- wide real-world implementation of AI for cancer detection in population- based mammography screening,”Nature Medicine, vol. 31, pp. 917–924, 2025

2025

[28] [28]

A review of the role of augmented intelligence in breast imaging: From automated breast density assessment to risk stratification,

A. Arieno, A. Chan, and S. V . Destounis, “A review of the role of augmented intelligence in breast imaging: From automated breast density assessment to risk stratification,”American Journal of Roentgenology, vol. 212, no. 2, pp. 259–270, 2019, pMID: 30422711. [Online]. Available: https://doi.org/10.2214/AJR.18.20391

work page doi:10.2214/ajr.18.20391 2019

[29] [29]

Variation in mammographic breast density assessments among radiologists in clinical practice: a multicenter observational study,

B. L. Sprague, E. F. Conant, T. Onega, M. P. Garcia, E. F. Beaber, S. D. Herschorn, C. D. Lehman, A. N. A. Tosteson, R. Lacson, M. D. Schnall, D. Kontos, J. S. Haas, D. L. Weaver, W. E. Barlow, and PROSPR Consortium, “Variation in mammographic breast density assessments among radiologists in clinical practice: a multicenter observational study,”Annals of ...

2016

[30] [30]

Breast cancer risk assessment in the ai era: The importance of model validation in ethnically diverse cohorts,

D. Kontos and J. Kalpathy-Cramer, “Breast cancer risk assessment in the ai era: The importance of model validation in ethnically diverse cohorts,”Radiology: Artificial Intelligence, vol. 5, no. 6, p. e230462,

[31] [31]

Available: https://doi.org/10.1148/ryai.230462

[Online]. Available: https://doi.org/10.1148/ryai.230462

work page doi:10.1148/ryai.230462

[32] [32]

Vindr-mammo: A large-scale benchmark dataset for computer- aided diagnosis in full-field digital mammography,

H. T. Nguyen, H. Q. Nguyen, H. H. Pham, K. Lam, L. T. Le, M. Dao, and V . Vu, “Vindr-mammo: A large-scale benchmark dataset for computer- aided diagnosis in full-field digital mammography,”Scientific Data, vol. 10, p. 277, 2023

2023

[33] [33]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231223011864

2024

[34] [34]

GLU Variants Improve Transformer

N. Shazeer, “Glu variants improve transformer,” 2020. [Online]. Available: https://arxiv.org/abs/2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [35]

Cluster and predict latent patches for improved masked image mod- eling,

T. Darcet, F. Baldassarre, M. Oquab, J. Mairal, and P. Bojanowski, “Cluster and predict latent patches for improved masked image mod- eling,” 2025

2025

[36] [36]

Objects as Points

X. Zhou, D. Wang, and P. Kr ¨ahenb¨uhl, “Objects as points,” inarXiv preprint arXiv:1904.07850, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[37] [37]

Index for rating diagnostic tests,

W. J. Youden, “Index for rating diagnostic tests,”Cancer, vol. 3, no. 1, pp. 32–35, 1950

1950

[38] [38]

ROC curves in clinical chemistry: uses, misuses, and possible solutions,

N. A. Obuchowski, M. L. Lieber, and F. H. Wians, “ROC curves in clinical chemistry: uses, misuses, and possible solutions,”Clinical Chemistry, vol. 50, no. 7, pp. 1118–1125, 2004

2004

[39] [39]

United states cancer statistics: Data visualizations,

U.S. Centers for Disease Control and Prevention, “United states cancer statistics: Data visualizations,” https://gis.cdc.gov/Cancer/USCS/, 2023, accessed: August 2023

2023

[40] [40]

Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach,

E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson, “Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach,”Biometrics, vol. 44, no. 3, pp. 837– 845, 1988

1988

[41] [41]

Optimizing breast cancer detection in mammograms: A comprehensive study of transfer learning, resolution reduction, and multi-view classification,

D. G. P. Petrini and H. Y . Kim, “Optimizing breast cancer detection in mammograms: A comprehensive study of transfer learning, resolution reduction, and multi-view classification,” 2025

2025

[42] [42]

MamT4: Multi-view attention networks for mammog- raphy cancer classification,

A. Ibragimov, S. Senotrusova, A. Litvinov, E. Ushakov, E. Karpulevich, and Y . Markin, “MamT4: Multi-view attention networks for mammog- raphy cancer classification,” in2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2024, pp. 1965–1970

2024

[43] [43]

Dual view deep learning for en- hanced breast cancer screening using mammography,

S. R. Kebede, F. G. Waldamichael, T. G. Debelee, M. Aleme, W. Bedane, B. Mezgebu, and Z. C. Merga, “Dual view deep learning for en- hanced breast cancer screening using mammography,”Scientific Reports, vol. 14, p. 3839, 2024

2024

[44] [44]

AUCReshaping: improved sensitivity at high-specificity,

S. Bhat, A. Mansoor, B. Georgescuet al., “AUCReshaping: improved sensitivity at high-specificity,”Scientific Reports, vol. 13, p. 21097, 2023

2023

[45] [45]

T-synth: A knowledge-based dataset of synthetic breast images,

C. Wiedeman, A. Sarmakeeva, E. Sizikova, D. Filienko, M. Lago, J. G. Delfino, and A. Badano, “T-synth: A knowledge-based dataset of synthetic breast images,”arXiv preprint arXiv:2507.04038, 2025

work page arXiv 2025