Predicting Blastocyst Formation in IVF: Integrating DINOv2 and Attention-Based LSTM on Time-Lapse Embryo Images
Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3
The pith
A hybrid DINOv2 and attention LSTM model predicts which embryos will form blastocysts from limited daily images at 96.4 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that DINOv2 extracts useful spatial features from embryo images and an LSTM equipped with multi-head attention then models their temporal progression to predict blastocyst formation, reaching 96.4 percent accuracy on a dataset of 704 videos while remaining robust to missing frames.
What carries the argument
The hybrid pipeline in which DINOv2 supplies per-image feature vectors that are then processed by a multi-head attention LSTM to capture developmental dynamics over time.
Load-bearing premise
The 704 embryo videos used for training and testing represent the range of imaging conditions and patient demographics encountered in other IVF laboratories.
What would settle it
Accuracy falling below 85 percent when the trained model is applied to embryo images collected at a different clinic with different time-lapse cameras or patient populations.
Figures
read the original abstract
The selection of the optimal embryo for transfer is a critical yet challenging step in in vitro fertilization (IVF), primarily due to its reliance on the manual inspection of extensive time-lapse imaging data. A key obstacle in this process is predicting blastocyst formation from the limited number of daily images available. Many clinics also lack complete time-lapse systems, so full videos are often unavailable. In this study, we aimed to predict which embryos will develop into blastocysts using limited daily images from time-lapse recordings. We propose a novel hybrid model that combines DINOv2, a transformer-based vision model, with an enhanced long short-term memory (LSTM) network featuring a multi-head attention layer. DINOv2 extracts meaningful features from embryo images, and the LSTM model then uses these features to analyze embryo development over time and generate final predictions. We tested our model on a real dataset of 704 embryo videos. The model achieved 96.4% accuracy, surpassing existing methods. It also performs well with missing frames, making it valuable for many IVF laboratories with limited imaging systems. Our approach can assist embryologists in selecting better embryos more efficiently and with greater confidence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a hybrid model that uses DINOv2 to extract features from time-lapse embryo images and feeds them into an attention-augmented LSTM to predict blastocyst formation. It evaluates the approach on a dataset of 704 embryo videos, reports 96.4% accuracy (surpassing prior methods), and claims robustness when frames are missing.
Significance. If the accuracy claim survives proper patient-level cross-validation and external testing, the work would offer a practical aid for embryo selection in IVF clinics that lack complete time-lapse systems. The choice of a pre-trained vision transformer plus temporal attention is a reasonable modern adaptation, and explicit handling of incomplete sequences addresses a genuine clinical constraint.
major comments (2)
- [Results] Results section: the headline 96.4% accuracy on 704 videos is presented without any information on train-test split ratios, patient- or embryo-level stratification, k-fold cross-validation, class balance, or statistical testing. In time-series embryo data, failure to isolate images from the same IVF cycle across splits risks leakage and renders the performance claim uninterpretable.
- [Methods] Methods section: no description is given of how the 704 videos were acquired (number of patients, embryos per patient, imaging protocol, or exact daily sampling), nor of the baseline methods, their hyper-parameters, or the statistical tests used to assert superiority. These omissions make it impossible to assess whether the reported gains are reproducible or clinically meaningful.
minor comments (1)
- [Abstract] The abstract would benefit from a single sentence on validation strategy to allow readers to gauge the 96.4% figure immediately.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important omissions in our description of the experimental protocol. We agree that these details are necessary for assessing the validity of our results and will revise the manuscript accordingly to enhance transparency and reproducibility.
read point-by-point responses
-
Referee: [Results] Results section: the headline 96.4% accuracy on 704 videos is presented without any information on train-test split ratios, patient- or embryo-level stratification, k-fold cross-validation, class balance, or statistical testing. In time-series embryo data, failure to isolate images from the same IVF cycle across splits risks leakage and renders the performance claim uninterpretable.
Authors: We agree that the original manuscript omitted these critical details on the evaluation protocol, which is a valid concern given the risk of data leakage in time-series embryo imaging. In the revised version, we will add a dedicated subsection detailing the train-test split ratios, patient-level stratification, k-fold cross-validation procedure, class balance, and the statistical tests used to compare against baselines. This will directly address the potential for leakage and make the 96.4% accuracy claim fully interpretable. revision: yes
-
Referee: [Methods] Methods section: no description is given of how the 704 videos were acquired (number of patients, embryos per patient, imaging protocol, or exact daily sampling), nor of the baseline methods, their hyper-parameters, or the statistical tests used to assert superiority. These omissions make it impossible to assess whether the reported gains are reproducible or clinically meaningful.
Authors: We acknowledge that the Methods section was insufficiently detailed regarding dataset acquisition and the implementation of baselines. We will expand this section in the revision to describe the acquisition process (including patient and embryo counts, imaging protocol, and daily sampling), provide full descriptions of the baseline methods along with their hyper-parameters, and specify the statistical tests employed. These additions will support reproducibility and allow readers to better evaluate the clinical relevance of the reported improvements. revision: yes
Circularity Check
Standard supervised ML pipeline with no circular derivation
full rationale
The paper describes a conventional supervised learning setup: DINOv2 extracts image features from time-lapse embryo frames, these features are fed into an LSTM with multi-head attention for temporal modeling, the network is trained on labeled videos, and accuracy is measured on held-out test data. No load-bearing step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no self-citation chain is invoked to justify the architecture or results. The reported 96.4% accuracy is an empirical evaluation metric, not a tautological consequence of the model definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. Eugster, A. J. Vingerhoets, Psychological aspects of in vitro fertil- ization: a review, Social science & medicine 48 (5) (1999) 575–589
work page 1999
-
[2]
D. A. Blake, M. Proctor, N. Johnson, D. Olive, C. M. Farquhar, Q. Lam- berts, Cleavage stage versus blastocyst stage embryo transfer in assisted conception, Cochrane Database of Systematic Reviews (4) (2005)
work page 2005
-
[3]
H. M. Lukassen, D. D. Braat, A. M. Wetzels, G. A. Zielhuis, E. M. Adang, E. Scheenjes, J. A. Kremer, Two cycles with single embryo transfer versus one cycle with double embryo transfer: a randomized controlled trial, Human Reproduction 20 (3) (2005) 702–708
work page 2005
-
[4]
J.E.Swain, Decisionsfortheivflaboratory: comparativeanalysisofem- bryo culture incubators, Reproductive biomedicine online 28 (5) (2014) 535–547
work page 2014
-
[5]
C. Wong, A. Chen, B. Behr, S. Shen, Time-lapse microscopy and image analysis in basic and clinical embryo development research, Reproduc- tive BioMedicine Online 26 (2) (2013) 120–129
work page 2013
-
[6]
Q. Liao, Q. Zhang, X. Feng, H. Huang, H. Xu, B. Tian, J. Liu, Q. Yu, N. Guo, Q. Liu, et al., Development of deep learning algorithms for predicting blastocyst formation and quality by time-lapse monitoring, Communications biology 4 (1) (2021) 415
work page 2021
-
[7]
R. Machtinger, C. Racowsky, Morphological systems of human embryo assessmentandclinicalevidence, Reproductivebiomedicineonline26(3) (2013) 210–221
work page 2013
- [8]
-
[9]
Z. A. Varzaneh, A. Orooji, L. Erfannia, M. Shanbehzadeh, A new covid- 19 intubation prediction strategy using an intelligent feature selection and k-nn method, Informatics in medicine unlocked 28 (2022) 100825
work page 2022
- [10]
-
[11]
Z. A. Varzaneh, S. M. Mousavi, R. Khoshkangini, S. M. Moosavi Khaliji, An ensemble model based on transfer learning for the early detection of alzheimer’s disease, Scientific Reports 15 (1) (2025) 34634
work page 2025
-
[12]
D. Shen, G. Wu, H.-I. Suk, Deep learning in medical image analysis, Annual review of biomedical engineering 19 (1) (2017) 221–248
work page 2017
-
[13]
M. I. Razzak, S. Naz, A. Zaib, Deep learning for medical image pro- cessing: Overview, challenges and the future, Classification in BioApps: Automation of decision making (2017) 323–350
work page 2017
-
[14]
E. I. Fernandez, A. S. Ferreira, M. H. M. Cecílio, D. S. Chéles, R. C. M. de Souza, M. F. G. Nogueira, J. C. Rocha, Artificial intelligence in the ivf laboratory: overview through the application of different types of algorithms for the classification of reproductive data, Journal of Assisted Reproduction and Genetics 37 (10) (2020) 2359–2376
work page 2020
- [15]
-
[16]
M. Abbasi, P. Saeedi, J. Au, J. Havelock, Time series classification for modality-converted videos: A case study on predicting human embryo implantation from time-lapse images, in: 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), IEEE, 2023, pp. 1–6
work page 2023
-
[17]
A.Sharma, A.Dorobantiu, S.Ali, M.Iliceto, M.H.Stensen, E.Delbarre, M. A. Riegler, H. L. Hammer, Deep learning methods to forecasting human embryo development in time-lapse videos, bioRxiv (2024) 2024– 03. 23
work page 2024
-
[18]
K. Kalyani, P. S. Deshpande, A deep learning model for predicting blas- tocyst formation from cleavage-stage human embryos using time-lapse images, Scientific Reports 14 (1) (2024) 28019
work page 2024
- [19]
-
[20]
Y. A. Mohamed, U. K. Yusof, I. S. Isa, M. M. Zain, An automated blas- tocyst grading system using convolutional neural network and transfer learning, in: 2023 IEEE 13th International Conference on Control Sys- tem, Computing and Engineering (ICCSCE), IEEE, 2023, pp. 202–207
work page 2023
-
[21]
A.A.Mazroa, M.Maashi, Y.Said, M.Maray, A.A.Alzahrani, A.Alkha- rashi, A. M. Al-Sharafi, Anomaly detection in embryo development and morphology using medical computer vision-aided swin transformer with boosted dipper-throated optimization algorithm, Bioengineering 11 (10) (2024) 1044
work page 2024
-
[22]
J. Kim, Z. Shi, D. Jeong, J. Knittel, H. Y. Yang, Y. Song, W. Li, Y. Li, D. Ben-Yosef, D. Needleman, et al., Multimodal learning for embryo vi- ability prediction in clinical ivf, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2024, pp. 542–552
work page 2024
-
[23]
X. Xie, P. Yan, F.-Y. Cheng, F. Gao, Q. Mai, G. Li, Early prediction of blastocyst development via time-lapse video analysis, in: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), IEEE, 2022, pp. 1–5
work page 2022
-
[24]
K. Garg, A. Dev, P. Bansal, H. Mittal, An efficient deep learning model for embryo classification, in: 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE, 2024, pp. 358–363
work page 2024
-
[25]
Z. A. Varzaneh, N. Wölner-Hanssen, R. Khoshkangini, A lightweight transformer approach for predicting blastocyst formation on limited em- bryo images, in: 2025 International Conference on Visual Communica- tions and Image Processing (VCIP), IEEE, 2025, pp. 1–5. 24
work page 2025
-
[26]
P. C. of the American Society for Reproductive Medicine, P. C. of the Society for Assisted Reproductive Technology, et al., Blastocyst culture and transfer in clinically assisted reproduction: a committee opinion, Fertility and Sterility 110 (7) (2018) 1246–1252
work page 2018
-
[27]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without supervision, arXiv preprint arXiv:2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
M. Hashemi, Enlarging smaller images before inputting into convolu- tional neural network: zero-padding vs. interpolation, Journal of Big Data 6 (1) (2019) 1–13
work page 2019
-
[29]
Y. Yu, X. Si, C. Hu, J. Zhang, A review of recurrent neural networks: Lstm cells and network architectures, Neural computation 31 (7) (2019) 1235–1270
work page 2019
-
[30]
D. Neil, M. Pfeiffer, S.-C. Liu, Phased lstm: Accelerating recurrent net- work training for long or event-based sequences, Advances in neural information processing systems 29 (2016)
work page 2016
-
[31]
S. M. Al-Selwi, M. F. Hassan, S. J. Abdulkadir, A. Muneer, E. H. Sum- iea, A. Alqushaibi, M. G. Ragab, Rnn-lstm: From applications to mod- eling techniques and beyond—systematic review, Journal of King Saud University-Computer and Information Sciences (2024) 102068
work page 2024
-
[32]
Multi-head attention: Collaborate instead of concatenate,
J.-B. Cordonnier, A. Loukas, M. Jaggi, Multi-head attention: Collabo- rate instead of concatenate, arXiv preprint arXiv:2006.16362 (2020)
-
[33]
Z. C. Lipton, D. C. Kale, C. Elkan, R. Wetzel, Learning to diagnose with lstm recurrent neural networks, arXiv preprint arXiv:1511.03677 (2015)
work page Pith review arXiv 2015
- [34]
-
[35]
Ž. Vujović, et al., Classification model evaluation metrics, International Journal of Advanced Computer Science and Applications 12 (6) (2021) 599–606. 25
work page 2021
-
[36]
URLhttps://www.kaggle.com/datasets/modlee/time-series- classification-data/data
Modlee, Car (2024). URLhttps://www.kaggle.com/datasets/modlee/time-series- classification-data/data
work page 2024
-
[37]
URLhttps://www.kaggle.com/datasets/shebrahimi/financial- distress?select=Financial+Distress.csv
Ebrahimi, Financial (2017). URLhttps://www.kaggle.com/datasets/shebrahimi/financial- distress?select=Financial+Distress.csv
work page 2017
-
[38]
L. Candanedo, Occupancy (2016). URLhttps://archive.ics.uci.edu/dataset/357/occupancy+dete ction
work page 2016
-
[39]
O. Roesler, Eeg (2016). URLhttps://archive.ics.uci.edu/dataset/264/eeg+eye+state 26
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.