Recognition: unknown
Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations
Pith reviewed 2026-05-10 11:29 UTC · model grok-4.3
The pith
Fine-tuning FinBERT on Finnish medical text produces embedding geometry shifts whose correlation with downstream performance the authors attempt to measure as a potential early signal for domain adaptation benefit.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning.
Load-bearing premise
That observable changes in embedding geometry during domain fine-tuning on unlabeled data will correlate with, and therefore predict, performance gains on downstream labeled classification tasks.
read the original abstract
In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports observations from domain fine-tuning of Finnish BERT (FinBERT) on unlabeled Finnish histopathological reports. It then attempts to use changes in the geometry of the embedding space (e.g., shifts in pairwise distances, variance, or alignment) induced by this fine-tuning to predict performance improvements on downstream labeled classification tasks. The motivation is to enable forecasting of domain adaptation benefits in settings where labeled data acquisition is delayed, such as healthcare AI.
Significance. If validated, the approach could provide a practical way to assess the value of domain-specific pre-training using only unlabeled data, which is particularly useful in low-resource or label-scarce domains like medical NLP. The work highlights potential train-time signals but currently lacks the quantitative evidence needed to establish predictive utility.
major comments (2)
- [Abstract and Results] Abstract and Results: The abstract and results describe observations and correlation attempts, but no specific methods, statistical controls, regression models, or quantitative metrics (such as correlation coefficients, p-values, or cross-validation scores) are reported for the prediction of downstream gains from geometry changes. This makes it impossible to evaluate whether the claimed correlations are robust or due to post-hoc selection.
- [Prediction step] Prediction step: The central claim that embedding geometry metrics can predict downstream performance requires a demonstrated quantitative link. Without comparison to null models, baseline predictors, or validation on held-out tasks, the predictive power remains unestablished, which is load-bearing for the motivating use-case of forecasting without labels.
minor comments (2)
- [Notation] Clarify the exact definitions of the geometry metrics used (e.g., what is meant by 'alignment metrics' or 'variance' in the embedding space).
- [Figures] Ensure all figures showing embedding changes include axis labels, legends, and statistical annotations if applicable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our exploratory study of domain fine-tuning signals in Finnish histopathological text. The comments correctly identify areas where the original submission was insufficiently quantitative, and we have revised the manuscript to incorporate additional statistical details and controls as described below.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: The abstract and results describe observations and correlation attempts, but no specific methods, statistical controls, regression models, or quantitative metrics (such as correlation coefficients, p-values, or cross-validation scores) are reported for the prediction of downstream gains from geometry changes. This makes it impossible to evaluate whether the claimed correlations are robust or due to post-hoc selection.
Authors: We agree that the original abstract and results sections lacked sufficient methodological transparency. The work was framed as an initial exploration of observable patterns rather than a validated prediction system. In the revised manuscript we have expanded both the abstract and the Results section to explicitly describe the geometry metrics computed (pairwise cosine distances, embedding variance, and Procrustes alignment), the downstream performance deltas, and the statistical procedures used. We now report Pearson and Spearman correlation coefficients, associated p-values, and a simple linear regression linking geometry shifts to F1 improvements, with the full set of tasks included to avoid post-hoc metric selection. revision: yes
-
Referee: [Prediction step] Prediction step: The central claim that embedding geometry metrics can predict downstream performance requires a demonstrated quantitative link. Without comparison to null models, baseline predictors, or validation on held-out tasks, the predictive power remains unestablished, which is load-bearing for the motivating use-case of forecasting without labels.
Authors: The manuscript presents observed correlations as a potential early signal rather than a proven forecasting method. To strengthen the quantitative link, the revision now includes (i) a null-model baseline obtained by randomly permuting the geometry metrics across tasks and recomputing correlations, (ii) comparison against a simple baseline predictor using only the quantity of unlabeled fine-tuning data, and (iii) leave-one-task-out validation across the downstream classification tasks. These additions are reported with the corresponding correlation coefficients and p-values so that readers can assess the incremental value of the geometry-based signals. revision: yes
Circularity Check
No significant circularity; purely observational with empirical correlation attempts
full rationale
The paper reports observations from domain fine-tuning of Finnish BERT on unlabeled histopathological text and describes attempts to correlate observable embedding geometry changes (e.g., pairwise distances or variance shifts) with downstream labeled-task gains. No derivation, equation, or first-principles claim reduces to its own inputs by construction. The 'prediction' component consists of empirical attempts at correlation rather than a fitted parameter renamed as a prediction or a self-definitional loop. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes are invoked. The work is self-contained as an exploratory observational study and does not manufacture a closed predictive result from the same data used to define the signal.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Universal language model fine-tuning for text classification
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018)
-
[2]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 13
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[5]
Transactions of the association for computational linguistics8, 842–866 (2021)
Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in bertology: What we know about how bert works. Transactions of the association for computational linguistics8, 842–866 (2021)
2021
-
[6]
Gururangan, S., Marasovi´ c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020)
-
[7]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[8]
Training Compute-Optimal Large Language Models
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute- optimal large language models. arXiv preprint arXiv:2203.15556 (2022)
work page internal anchor Pith review arXiv 2022
-
[9]
Brandfonbrener, D., Anand, N., Vyas, N., Malach, E., Kakade, S.: Loss-to-loss prediction: Scaling laws for all datasets. arXiv preprint arXiv:2411.12925 (2024)
-
[10]
Data10(7), 104 (2025) https://doi.org/10
Myllyl¨ a, E., Siirtola, P., Isosalo, A., Reponen, J., Tamminen, S., Laatikainen, O.: Extracting information from unstructured medical reports written in minority languages: A case study of finnish. Data10(7), 104 (2025) https://doi.org/10. 3390/data10070104
2025
-
[11]
International nursing review67(2), 218–230 (2020)
Bani Issa, W., Al Akour, I., Ibrahim, A., Almarzouqi, A., Abbas, S., Hisham, F., Griffiths, J.: Privacy, confidentiality, security and patient safety concerns about electronic health records. International nursing review67(2), 218–230 (2020)
2020
-
[12]
Egyptian Informatics Journal22(2), 177–183 (2021)
Keshta, I., Odeh, A.: Security and privacy of electronic health records: Concerns and challenges. Egyptian Informatics Journal22(2), 177–183 (2021)
2021
-
[13]
arXiv preprint arXiv:1912.07076 (2019)
Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Gin- ter, F., Pyysalo, S.: Multilingual is not enough: Bert for finnish. arXiv preprint arXiv:1912.07076 (2019)
-
[14]
Bioinformatics36(4), 1234–1240 (2020)
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2020)
2020
-
[15]
In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp
Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., McDermott, M.: Publicly available clinical bert embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78 (2019) 14
2019
-
[16]
T¨ urkmen, H., Dikenelli, O., Eraslan, C., C ¸ allı, M.C., ¨Ozbek, S.S.: Bioberturk: Exploring turkish biomedical language model development strategies in low- resource setting. Journal of Healthcare Informatics Research7(4), 433–446 (2023) https://doi.org/10.1007/s41666-023-00140-7
-
[17]
Computers in Biology and Medicine182, 109233 (2024) https://doi.org/10.1016/j.compbiomed.2024
Nunes, M., Bon´ e, J.a., Ferreira, J.a.C., Chaves, P., Elvas, L.B.: Medial- bertina: An european portuguese medical language model. Computers in Biology and Medicine182, 109233 (2024) https://doi.org/10.1016/j.compbiomed.2024. 109233
-
[18]
Bui, N., Nguyen, G., Nguyen, N., Vo, B., Vo, L., Huynh, T., Tang, A., Tran, V.N., Huynh, T., Nguyen, H.Q., Dinh, M.: Fine-tuning large language models for improved health communication in low-resource languages. Computer Meth- ods and Programs in Biomedicine263, 108655 (2025) https://doi.org/10.1016/j. cmpb.2025.108655
work page doi:10.1016/j 2025
-
[19]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp
Warner, B., Chaffin, A., Clavi´ e, B., Weller, O., Hallstr¨ om, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T.,et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long con- text finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computation...
2025
-
[20]
MIT press, ??? (2023)
Prince, S.J.: Understanding Deep Learning. MIT press, ??? (2023)
2023
-
[21]
In: Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pp
Luotolahti, J., Kanerva, J., Laippala, V., Pyysalo, S., Ginter, F.: Towards uni- versal web parsebanks. In: Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pp. 211–220 (2015)
2015
-
[22]
arXiv preprint arXiv:2002.05688 (2020)
Eilertsen, G., J¨ onsson, D., Ropinski, T., Unger, J., Ynnerman, A.: Classifying the classifier: dissecting the weight space of neural networks. arXiv preprint arXiv:2002.05688 (2020)
-
[23]
Advances in neural information processing systems30(2017)
Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems30(2017)
2017
-
[24]
Advances in neural information processing systems31(2018)
Morcos, A., Raghu, M., Bengio, S.: Insights on representational similarity in neural networks with canonical correlation. Advances in neural information processing systems31(2018)
2018
-
[25]
Advances in Neural Information Processing Systems 34, 1556–1568 (2021)
Ding, F., Denain, J.-S., Steinhardt, J.: Grounding representation similarity through statistical testing. Advances in Neural Information Processing Systems 34, 1556–1568 (2021)
2021
-
[26]
arXiv 15 preprint arXiv:2509.04622 (2025)
Wu, J., Saha, S., Bo, Y., Khosla, M.: Measuring the measures: Discrimina- tive capacity of representational similarity metrics across model families. arXiv 15 preprint arXiv:2509.04622 (2025)
-
[27]
arXiv preprint arXiv:2411.14633 (2024)
Bo, Y., Soni, A., Srivastava, S., Khosla, M.: Evaluating representational sim- ilarity measures from the lens of functional correspondence. arXiv preprint arXiv:2411.14633 (2024)
-
[28]
Advances in neural information processing systems34, 4738–4750 (2021)
Williams, A.H., Kunz, E., Kornblith, S., Linderman, S.: Generalized shape metrics on neural representations. Advances in neural information processing systems34, 4738–4750 (2021)
2021
-
[29]
In: International Conference on Machine Learning, pp
Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network representations revisited. In: International Conference on Machine Learning, pp. 3519–3529 (2019). PMlR
2019
-
[30]
Advances in neural information processing systems32(2019)
Maheswaranathan, N., Williams, A., Golub, M., Ganguli, S., Sussillo, D.: Univer- sality and individuality in neural dynamics across large populations of recurrent networks. Advances in neural information processing systems32(2019)
2019
-
[31]
In: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp
Phang, J., Liu, H., Bowman, S.: Fine-tuned transformers show clusters of sim- ilar representations across layers. In: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 529–538 (2021)
2021
-
[32]
arXiv preprint arXiv:1702.03859 (2017)
Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)
-
[33]
Neuron72(2), 404–416 (2011)
Haxby, J.V., Guntupalli, J.S., Connolly, A.C., Halchenko, Y.O., Conroy, B.R., Gobbini, M.I., Hanke, M., Ramadge, P.J.: A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron72(2), 404–416 (2011)
2011
-
[34]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Dwivedi, K., Roig, G.: Representation similarity analysis for efficient task tax- onomy & transfer learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12387–12396 (2019)
2019
-
[35]
Frontiers in systems neuro- science2, 249 (2008)
Kriegeskorte, N., Mur, M., Bandettini, P.A.: Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuro- science2, 249 (2008)
2008
-
[36]
Mu, J., Bhat, S., Viswanath, P.: All-but-the-top: Simple and effective postpro- cessing for word representations. arXiv preprint arXiv:1702.01417 (2017)
-
[37]
In: 2007 15th European Signal Processing Conference, pp
Roy, O., Vetterli, M.: The effective rank: A measure of effective dimensionality. In: 2007 15th European Signal Processing Conference, pp. 606–610 (2007). IEEE
2007
-
[38]
In: Findings of the Association for Computational Linguistics: ACL 2022, pp
Rudman, W., Gillman, N., Rayne, T., Eickhoff, C.: Isoscore: Measuring the 16 uniformity of embedding space utilization. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3325–3339 (2022)
2022
-
[39]
Journal of machine learning research3(Dec), 583–617 (2002)
Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for com- bining multiple partitions. Journal of machine learning research3(Dec), 583–617 (2002)
2002
-
[40]
Journal of computational and applied mathematics20, 53–65 (1987)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics20, 53–65 (1987)
1987
-
[41]
Journal of classification2(1), 193– 218 (1985)
Hubert, L., Arabie, P.: Comparing partitions. Journal of classification2(1), 193– 218 (1985)
1985
-
[42]
New Generation Computing41(1), 109–134 (2023)
Nagatsuka, K., Broni-Bediako, C., Atsumi, M.: Length-based curriculum learning for efficient pre-training of language models. New Generation Computing41(1), 109–134 (2023)
2023
-
[43]
MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cam- bridge university press, Cambridge, UK (2003) A Details on our datasets 1.Histopathological reports:This was a dataset provided by the Central Finland Biobank, containing medical descriptions of patients who had ended up getting histopathological studies. Labels were not readily ava...
2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.